Unicode 3.0 and Surrogate Block in XML


Date:       Sat, 18 Sep 1999 00:36:37 -0400 (EST)
From:       Tony Graham <tgraham@mulberrytech.com>
To:         abrahams@acm.org
Cc:         XMLDev list <xml-dev@ic.ac.uk>
Subject:    Re: Unicode surrogate block in XML?

At 17 Sep 1999 22:16 -0400, Paul W. Abrahams wrote:
 > Tony Graham (tgraham@mulberrytech.com)
 > Fri, 17 Sep 1999 01:15:51 -0400 (EST)
 > 
 > >> In any XML document, you can make numeric references to any Unicode
 > 
 > character in the range #x10000 to #x10FFFF (as well as to any other
 > legal character number).  These references are independent of the
 > encoding used in the XML document. <<
 > 
 > Is it really correct to refer to #x10FFFF, say, as a Unicode
 > character, since Unicode characters are limited to 16 bits?  I'd think
 > it's necessary here to refer to that as a UCS-4 character.

The Unicode Standard started out with the design principle that all
characters have a uniform width of 16 bits.  The expectation was that
the 65,000 or so characters that you can address with 16 bits would
far exceed the requirements.  However, reality intruded, and the
practicalities (and possibly the political realities) of defining a
universal character set has meant that there are more characters to be
defined than can fit in a 16-bit address space.

Unicode 2.0, published in 1996, defines the Surrogate block and a
mechanism for using two code values from the surrogate block to
address over one million extra characters.

The Unicode Standard, Version 2.0, supports surrogates, but doesn't
quite know what to do about them.  Section 3.7 of the Unicode
Standard, Version 2.0, defines surrogates, and they are mentioned
again in section C.3, but you're left with the impression that they
and UTF-16 are really an ISO/IEC 10646 thing.  UTF-16 was initially
defined in Amendment 1 of ISO/IEC 10646-1:1993, so it wasn't far off
the mark.

Planes 15 and 16 are reserved for private use, so there's been a
legitimate use for surrogates, or, more broadly, for using characters
outside Plane 0, since 1996.

Since 1996, however, there have been numerous proposals for scripts to
be included in the Unicode Standard and ISO/IEC 10646, and many of
these are slated for definition in Plane 1, i.e. they'll need more
than 16 bits to address the characters.  As far as I know, none have
been assigned code values yet, but it won't be too long after the
release of the Unicode Standard, Version 3.0, and ISO/IEC
10646-1:2000.  Furthermore, Plane 2 is reserved as the CJK Unified
Ideographs Supplementary Plane, and it already has 41,000 characters
lined up for inclusion.

 > >> The sequence of #xD800 #xDC00 is the two Surrogate code values that
 > 
 > address #x10000.  That four-byte sequence may occur in a UTF-16
 > encoded file to represent #x10000.  In contrast, "&#xD800;&#xDC00;" in
 > 
 > an XML document is two illegal character references in a row. <<
 > 
 > I've been trying to fathom the distinction between Unicode and UTF-16,
 > if there is one, and how these in turn relate to the UCS-2 encoding of

There isn't one anymore.  The Unicode Standard used to say that it
corresponded to UCS-2, but now it has embraced UTF-16 (and given us
UTF-16BE and UTF-16LE for big-endian and little-endian representations
without the BOM, respectively).

The Unicode Consortium now also defines UTF-32, which is a 32-bit
representation of the characters that you can address with UTF-16.
There is no difference between the UTF-32 representation of a
character and the UCS-4 representation of a character over the range
of characters that you can address with UTF-32.  The only difference
is that when you say that your document is UTF-32, you're saying that
it comes with the Unicode character semantics and conformance
requirements rather than the different requirements of UCS-4.

UTF-8 has also come into the fold since 1996.  In the Unicode
Standard, Version 2.0, UTF-8 was relegated to section A.2, but now
it's an accepted alternative for UTF-16.

 > ISO 10646.  There's also the question of whether an XML document can
 > be stored directly in Unicode, or whether instead it must be stored in
 > either UTF-8 or UTF-16,  as Section 2.2 seems to imply when it says
 > ``all XML processors must accept the UTF-8 and UTF-16 encodings of
 > 10646''.   The latter appears to be the case; but if it isn't, then
 > how would an XML  document be stored directly in Unicode?   I've

UTF-8 and UTF-16 can encode the characters of the Unicode Standard.

The Unicode Standard used to miss an aspect compared to how some people,
e.g. some ISO standards, define a character set.

Roughly speaking, the base aspect is the character repertoire, which is
a collection of abstract characters.

The next aspect is a mapping of the character repertoire onto a set of
numbers.

The third aspect is mapping the character numbers onto some
representation as bits or bytes.

The Unicode Standard used to conflate the second and third aspects
since the character numbers are identical to the value of the 16-bit
quantities that you can use to represent the characters.  Hence it
seems like a Unicode character is its 16-bit character number.  This
simplification falls down when you have character numbers that you
can't express with 16-bits and you allow other bit representations for
the characters.

You'll find that the Unicode Consortium now speaks about UTF-8,
UTF-16, UTF-32, and UTF-EBCDIC.  The favourite is probably still
UTF-16, but even UTF-16 isn't one 16-bit quantity to one character.

Also, the Unicode character encoding model
(http://www.unicode.org/unicode/reports/tr17/) now has five levels.

 > pondered both Appendix C of the Unicode Standard and the relevant part
 > of the FAQ on the Unicode website, and I'm still unclear about all of
 > this.  (By the way, the FAQ erroneously refers to UTF as the Unicode
 > Transformation Format rather than the UCS transformation format.)

There are two definitions for UTF.  ISO/IEC 10646 always defines it as
"UCS transformation format", and the Unicode Consortium mostly defines
it as "Unicode transformation format" (see section C.3 of the Unicode
Standard, Version 2.0, for an exception).  They mean the same thing.

 > In any event, thanks, Tony, for your very enlightening response to my
 > original query.

I hope this remains enlightening, and not overwhelming.

Regards,


Tony Graham
======================================================================
Tony Graham                            mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1


Date: Fri, 17 Sep 1999 01:15:51 -0400 (EST)
From: Tony Graham <tgraham@mulberrytech.com>
To: XMLDev list <xml-dev@ic.ac.uk>
Subject: Re: Unicode surrogate block in XML?

At 16 Sep 1999 18:12 -0400, Paul W. Abrahams wrote:
 > The XML 1.0 spec explicitly excludes the Unicode surrogate characters
 > from XML documents (production 2).  It now seems, from information
 > I've picked up on the Unicode web site, that surrogate characters are
 > likely to play a more important role in the future, since the
 > available 16-bit characters are almost all used up.  (Unicode 2.0 has
 > 18,134 spares but Unicode 3.0 has only 7827 spares.  The trend is
 > clear.)
 > 
 > Is any thought being given in W3C to allowing surrogate characters in
 > XML documents?

The code values from the Surrogate block (soon to be the High
Surrogates, High Private Use Surrogates, and Low Surrogates) are not
allowed in XML documents, but the characters that you reference with
the two parts of a Surrogate Pair are definitely allowed.

The characters that you can address with a Surrogate Pair are in the
range #x10000 to #x10FFFF.  In Unicode terminology, this is the
Unicode Scalar Value of the Surrogate Pair.

Production 2 from the XML Recommendation shows that these are legal
characters:

[2]  Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
               | [#x10000-#x10FFFF] 

In a UTF-16 encoded document, you can use the code values from the
Surrogate block to refer to these characters. It would be an error if,
for example, you used an unpaired Surrogate code value, but any UTF-16
application is going to complain about or ignore an unpaired
surrogate.

In a UTF-8 encoded document, you can refer to the characters in the
range #x10000 to #x10FFFF using a four-byte sequence that has no
relationship to the code values in the Surrogate block.

In UCS-4 (or the new UTF-32) you can directly represent characters in
the range #x10000 to #x10FFFF.

In any XML document, you can make numeric references to any Unicode
character in the range #x10000 to #x10FFFF (as well as to any other
legal character number).  These references are independent of the
encoding used in the XML document.

#x10000 is the first code value outside the Basic Multilingual Plane
(the ISO/IEC 10646 term for the characters in the range #x0 to
#xFFFF).  "&#x10000;" is the hexadecimal numeric reference for this
code value.

The sequence of #xD800 #xDC00 is the two Surrogate code values that
address #x10000.  That four-byte sequence may occur in a UTF-16
encoded file to represent #x10000.  In contrast, "&#xD800;&#xDC00;" in
an XML document is two illegal character references in a row.

Regards,


Tony Graham
======================================================================
Tony Graham                            mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1



Date: Thu, 16 Sep 1999 17:37:22 -0700
From: Tim Bray <tbray@textuality.com>
To: "Paul W. Abrahams" <abrahams@valinet.com>,
    XMLDev list <xml-dev@ic.ac.uk>
Subject: Re: Unicode surrogate block in XML?

At 06:12 PM 9/16/99 -0400, Paul W. Abrahams wrote:
>The XML 1.0 spec explicitly excludes the Unicode surrogate characters
>from XML documents (production 2).  It now seems, from information
>I've picked up on the Unicode web site, that surrogate characters are
>likely to play a more important role in the future, since the
>available 16-bit characters are almost all used up.  (Unicode 2.0 has
>18,134 spares but Unicode 3.0 has only 7827 spares.  The trend is
>clear.)

No. Production [2] says

[2] Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF]
              | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This follows the unicode model in allowing 17 planes of 64k characters
each, i.e. about a million characters.  For this to work in UTF-16, you
need surrogate pairs.  What XML rules out is *characters* whose numeric 
value is that of one-half of a surrogate pair.  There will never be any
such characters precisely because those values are reserved for use in 
surrogate pairs.  That's why XML rules them out. -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1