A revised version of Unicode in XML and other Markup Languages has been published as Unicode Technical Report #20, Revision 7 and as W3C Note 13-June-2003. This revision reflects three principal changes: (1) The base version of the Unicode Standard for this document is Unicode Version 4.0, which creates some 1,226 new Unicode character assignments; greater prominence is given to material in a new Section 3, "Characters Not Suitable for Use With Markup"; (3) a new Section 6 clarifies the appropriate uses of 66 non-character code points, or Unicode noncharacters. Section 3 discusses characters "which are unsuitable in the context of markup in XML/HTML and whose use is discouraged for one or more reasons; for exmaple, they are deprecated in the Unicode Standard, they are unsupportable without additional data, they are difficult to handle because they are stateful, they are better handled by markup, or because of conflict with equivalent markup." For the character classes in question, the Technical Report provides a short description of semantics, the reason for inclusion of the characters in Unicode, clarification of the specific problems when used with markup, related areas where problems may occur (e.g., in plain text), what kind of markup to use instead of Unicode characters, and what different classes of software should do if the problematic characters detected in a particular context.
Bibliographic Information
Unicode in XML and other Markup Languages. By Martin Dürst (W3C) and Asmus Freytag (TechnicalVice President, Unicode Consortium). Joint Publication by the Unicode Technical Committee and the W3C Internationalization Working Group and Interest Group. Publication date: 2003-06-13 (13 June 2003).
- Unicode reference: Unicode Technical Report #20, Revision 7. Version URL: http://www.unicode.org/reports/tr20/tr20-7.html. Latest version URL: http://www.unicode.org/reports/tr20/. Previous version URL: http://www.unicode.org/reports/tr20/tr20-6.html.
- W3C reference: W3C Note 13 June 2003. Version URL: http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/. Latest version URL: http://www.unicode.org/reports/tr20/. Previous version URL: http://www.w3.org/TR/2002/NOTE-unicode-xml-20020218/.
Characters not Suitable for Use With Markup
Subsection 3.1 in the Revision 6 document has been promoted to a top-level document section "Characters not Suitable for Use With Markup" in Revision 7 of Unicode Technical Report #20:
3.1 Table of Characters not Suitable for Use With Markup. A table of "characters currently considered not suitable for use with markup in XML or HTML. They may also be unsuitable for other markup or page layout languages. For determining possible conflict this report uses the markup available in HTML."
3.2 Line and Paragraph Separator. "The line and paragraph separator provide unambiguous means to denote hard line breaks and paragraph delimiters in plain text... Including these characters in markup text does not work where it would duplicate the existing markup commands for delimiting paragraphs and lines... The separator characters can also problematic when used in plain text, because legacy data is usually converted code point for code point into Unicode and all receivers of Unicode plain text have to effectively be able to interpret the existing use of control codes for this purpose. As a result, fewer Unicode implementations support these characters, than would be the case otherwise..."
3.3 Bidi Embedding Controls. "The bidi embedding controls are required to supplement the Unicode Bidirectional Algorithm in plain text... These characters duplicate available markup, which is better suited to handle the stateful nature of their effect... The embedding controls introduce a state into the plain text, which must be maintained when editing or displaying the text. Processes that are modifying the text without being aware of this state may inadvertently affect the rendering of large portions of the text, for example by removing a PDF..."
3.4 Deprecated Formatting Characters. "These characters were retained from draft versions of ISO 10646, originally intended to allow explicit activation of contextual shaping, numeric digit rendering and symmetric swapping... The most likely effect of their occurrence in generated text would be that of a 'garbage' character... When received by a browser as part of marked up text, they may be ignored. When received in an editing context, they may be removed, possibly with a warning..."
3.5 Byte Order Mark. "U+FEFF has two functions. It is formally known as zero width no-break space (ZWNBSP), and can act as a word joiner, but its primary use is as byte order mark (BOM), to indicate in a file signature that a file is in a Unicode encoding form and of a particular byte order... [It was] originally included in Unicode for the sole purpose of indicating byte order or use in file signatures, the character acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and Unicode... Using U+FEFF as ZWNBSP makes it impossible to distinguish it from the case where a byte order mark was left in the middle of a file inadvertently due to incorrect splicing..."
3.6 Interlinear Annotation Characters. "The interlinear annotation characters are used to delimit interlinear annotations in certain circumstances. They are intended to provide text anchors and delimiters for interlinear annotation for in-process use and are not intended for interchange... [they] were included in Unicode only in order to reserve code points for very frequent application-internal use. The interlinear annotation characters are used to delimit interlinear annotations in contexts where other delimiters are not available, and where non-textual means exist to carry formatting information. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. This is called out-of-band information... Including interlinear annotation characters in marked-up text does not work because the additional formatting information (how to position the annotation,...) is not available... The interlinear annotation characters are also problematic when used in plain text, and are not intended for that purpose. In particular, on older display systems that simply ignore or replace the Interlinear Annotation Characters, the meaning of the text may be changed..."
3.7 Object Replacement Character. "The object replacement character is used to stand in place of an object (e.g., an image) included in a text... The object replacement character was included in Unicode only in order to reserve a codepoint for a very frequent application-internal use... Including an object replacement character in markup text does not work because the additional information (what object to include,...) is not available... The object replacement character is also problematic when used in plain text, because there is no way in plain text to provide the actual object information or a reference to it..."
3.8 Musical Controls. Unicode U+1D173...U+1D17A is a "series of characters for controlling scope in musical notation... These characters designate the start and end of common musical constructs. Full musical layout depends on additional information, for example pitch, that cannot be encoded using Unicode... These characters duplicate information that can in principle be expressed in markup... Their special code range allows them to be easily filtered, but applications that do not expect them will treat them as garbage characters..."
3.9 Language Tag Characters. "Unicode Language Tag Characters U+E0000...U+E007F represent a series of characters for expressing language tags, based on existing standards for language tags using the rules in Chapter 15 of the [Unicode Standard]... These characters allow in-band language tagging in situations where full markup is not available, while allowing easy filtering by applications that do not support them. They were solely included for the benefit of those Internet protocols, such as ACAP, which require a standard mechanism for marking language in UTF-8 strings, and at the same time to avoid the use of other tagging schemes that relied on specific details of the encoding form used... [However.] These characters duplicate information that can be expressed in markup... Their special code range allows them to be easily filtered, but applications that do not expect them will treat them as garbage characters... Replace [them] with equivalent language markup. XML and XHTML have the xml:lang attribute. HTML has the lang attribute. These attributes follow different scoping rules than the tag characters, therefore this replacement will generally not be a simple 1:1 substitution..."
Noncharacters
A new Section 6 on "Noncharacters" was added in Revision 7 of Unicode Technical Report #20:
"The Unicode Standard defines 66 non-character code points, or noncharacters. These are the last two positions on each of the 17 planes, in other words, all characters whose code points end in ...FFFE or ...FFFF, as well as the 32 code points from U+FDD0 to U+FDEF. Applications are free to use any of these code points internally but should never attempt to interchange them. In effect, noncharacters can be thought of as application-internal private-use code points."
Changes in the Unicode Standard Version 4.0
"1,226 new character assignments were made to the Unicode Standard Version 4.0 over and above what was in Unicode 3.2. These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants). Double diacritic characters were added for dictionary use.
Major additions to Unicode Version 4.0 since Version 3.0 include:
- major changes to the introductory and conformance chapters, and extensive revisions to the discussion of punctuation, symbols, and format characters
- extensive additions of CJK characters to cover dictionaries and historic usage
- many new symbols for mathematical and technical publication
- many individual characters such as currency symbols were added to other scripts, including Indic, Khmer, Latin, Greek, Arabic, and Syriac
- substantially improved specification of conformance requirements, incorporating the character encoding model
- encoding of supplementary characters
- formalized policies for stability of the standard
- clarification of semantics of special characters, including the byte order mark
- major expansion of Unicode Character Database properties and of specifications for text boundaries and casing
- more minority scripts, including Limbu, Tai Le, Osmanya, and Philippine scripts
- more historic scripts, including Linear B, Cypriot, and Ugaritic
- tightened definition of encoding terms, including UTF-32
- substantial improvements to the script descriptions, particularly for Indic scripts and Khmer" [adapted from the overview document]
Principal references:
- Unicode in XML and other Markup Languages. Unicode Technical Report #20, Revision 7.
- Unicode in XML and other Markup Languages. W3C Note 13-June-2003.
- The Unicode Homepage
- The Unicode Consortium website
- Unicode Technical Committee (UTC) website
- Unicode Version 4.0
- Versions of the Unicode Standard
- Unicode Technical Reports
- W3C Internationalization Activity
- W3C Internationalization Activity Statement
- World Wide Web Consortium (W3C) website
- W3C Technical Reports and Publications
- "XML and Unicode" - Main reference page.
- "Topical References: Markup and Multilingualism" - Main reference page.