Cover Pages Logo SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic

First draft of proposed XML TC for Unicode 3.0


Date:     Tue, 7 Sep 1999 17:44:16 -0400 (EDT)
From:     John Cowan <cowan@locke.ccil.org>
To:       xml-dev@ic.ac.uk
Subject:  First draft of proposed XML TC for Unicode 3.0 (unofficial)


This is version 0.1 of a proposed technical corrigendum to XML 1.0
to incorporate the new characters of Unicode 3.0 into the allowable
sets used in XML Names.  It presumes that XML should not
remain limited to an obsolete version of the Unicode and ISO 10646
standards.

The new scripts handled are:
Cherokee, Ethiopic, Khmer, Mongolian, Myanmar, Ogham, Runic, Syriac,
Thaana, Unified Canadian Aboriginal Syllabics, Yi.

These lists of new characters were constructed by using the current Unicode 3.0
data file from the Unicode Consortium and applying the rules given
in Appendix B to it.  This version of the proposal does not
yet incorporate information from the Unicode 3.0 properties list.

(Unicode 3.0 is technically still in beta, but the character list has
been frozen for months now.)

New BaseChars (BNF rule 85):

[#x01F6-#x01F9] /* new Latin letters */
| [#x0218-#x021F]
| [#x0222-#x0233]
| [#x02A9-#x02AD] /* new IPA Latin letters */
| #x03D7 /* new Greek letters */
| #x03DB
| #x03DD
| #x03DF
| #x03E1
| #x0400 /* new Cyrillic letters */
| #x040D
| #x0450
| #x045D
| [#x048C-#x048F]
| [#x04EC-#x04ED]
| [#x06B8-#x06B9] /* new Arabic letters */
| #x06BF
| #x06CF
| [#x06FA-#x06FC]
| #x0710 /* new Syriac script */
| [#x0712-#x072C]
| [#x0780-#x07A5] /* new Thaana script */
| #x0950 /* OM letters */
| #x0AD0
| [#x0D85-#x0D96] /* new Sinhala script */
| [#x0D9A-#x0DB1]
| [#x0DB3-#x0DBB]
| #x0DBD
| [#x0DC0-#x0DC6]
| #x0E2F / * new Thai characters */
| #x0EAF
| #x0F00 /* Tibetan OM */
| #x0F6A /* new Tibetan letters */
| [#x1000-#x1021] /* new Myanmar script */
| [#x1023-#x1027]
| [#x1029-#x102A]
| [#x1050-#x1055]
| #x1101 /* Hangul jamo that are no longer compatibility characters */
| #x1104
| #x1108
| #x110A
| #x110D
| [#x1113-#x113B]
| #x113D
| #x113F
| [#x1141-#x114B]
| #x114D
| #x114F
| [#x1151-#x1153]
| [#x1156-#x1158]
| #x1162
| #x1164
| #x1166
| #x1168
| [#x116A-#x116C]
| [#x116F-#x1171]
| #x1174
| [#x1176-#x119D]
| [#x119F-#x11A2]
| [#x11A9-#x11AA]
| [#x11AC-#x11AD]
| [#x11B0-#x11B6]
| #x11B9
| #x11BB
| [#x11C3-#x11EA]
| [#x11EC-#x11EF]
| [#x11F1-#x11F8]
| [#x1200-#x1206] /* new Ethiopic script */
| [#x1208-#x1246]
| #x1248
| [#x124A-#x124D]
| [#x1250-#x1256]
| #x1258
| [#x125A-#x125D]
| [#x1260-#x1286]
| #x1288
| [#x128A-#x128D]
| [#x1290-#x12AE]
| #x12B0
| [#x12B2-#x12B5]
| [#x12B8-#x12BE]
| #x12C0
| [#x12C2-#x12C5]
| [#x12C8-#x12CE]
| [#x12D0-#x12D6]
| [#x12D8-#x12EE]
| [#x12F0-#x130E]
| #x1310
| [#x1312-#x1315]
| [#x1318-#x131E]
| [#x1320-#x1346]
| [#x1348-#x135A]
| [#x13A0-#x13F4] /* new Cherokee script */
| [#x1401-#x166C] /* new Canadian Syllabics script */
| [#x166F-#x1676]
| [#x1681-#x169A] /* new Ogham script */
| [#x16A0-#x16EA] /* new Runic script */
| [#x1780-#x17B3] /* new Khmer script */
| [#x1820-#x1842] /* new Mongolian script */
| [#x1844-#x1877]
| [#x1880-#x18A8]
| #x3006 /* Ideographic closing mark */
| [#x31A0-#x31B7] /* new Bopomofo letters */
| [#xA000-#xA48C] /* new Yi script */

IMHO none of these are controversial except perhaps the Hangul jamo.
Formerly, some Hangul jamo had compatibility decompositions into
sequences of other Hangul jamo.  These decompositions have been
removed from the Unicode Standard (actually in 2.1), so the jamo
should now be allowed in XML names in accordance with the rules in Appendix B.

New Ideographics (BNF rule 86):

[#x3400-#x4DB5] /* CJK Ideograph Extension A */

New CombiningChars (BNF rule 87):

[#x0346-#x034E] /* new IPA combining characters */
| #x0362
| [#x0488-#x0489] /* new Cyrillic combining characters */
| [#x0653-#x0655] /* new Arabic combining characters */
| #x0711 /* combining characters for new Syriac script */
| [#x0730-#x074A]
| [#x07A6-#x07B0] /* combining characters for new Thaana script */
| [#x0D82-#x0D83] /* combining characters for new Sinhala script */
| #x0DCA
| [#x0DCF-#x0DD4]
| #x0DD6
| [#x0DD8-#x0DDF]
| [#x0DF2-#x0DF3]
| #x0F96 /* new Tibetan subjoined letters */
| [#x0FAE-#x0FB0]
| #x0FB8
| [#x0FBA-#x0FBC]
| #x0FC6 /* new Tibetan combining character */
| [#x102C-#x1032] /* combining characters for new Myanmar script */
| [#x1036-#x1039]
| [#x1056-#x1059]
| [#x17B4-#x17D3] /* combining characters for new Khmer script */
| #x18A9 /* combining character for new Mongolian script */
| [#x20E2-#x20E3] /* new general combining characters */

IMHO none of these are controversial except perhaps the #x20E2 and #x20E3,
which are primarily intended for use with symbol characters, and therefore
should perhaps be excluded as #x20DD-#x20E0 are.

New Digits (BNF rule 88):

[#x1040-#x1049] /* digits for new Myanmar script */
| [#x1369-#x1371] /* digits for new Ethiopic script */
| [#x17E0-#x17E9] /* digits for new Khmer script */
| [#x1810-#x1819] /* digits for new Mongolian script */

IMHO none of these will be controversial.

New Extenders (BNF rule 89):

#x02EE /* Modifier letter double apostrophe */
| #x1843 /* Modifier letter for new Mongolian script */

IMHO none of these will be controversial.

In addition, the following characters no longer pass the tests given
in Appendix B for valid name or name-start characters, but should
remain legal in XML names for backward compatibility, and therefore
should be explicitly enumerated in the corrigendum:

03D0;GREEK BETA SYMBOL
03D1;GREEK THETA SYMBOL
03D2;GREEK UPSILON WITH HOOK SYMBOL
03D5;GREEK PHI SYMBOL
03D6;GREEK PI SYMBOL
03F0;GREEK KAPPA SYMBOL
03F1;GREEK RHO SYMBOL
03F2;GREEK LUNATE SIGMA SYMBOL
0675;ARABIC LETTER HIGH HAMZA ALEF
0676;ARABIC LETTER HIGH HAMZA WAW
0677;ARABIC LETTER U WITH HAMZA ABOVE
0678;ARABIC LETTER HIGH HAMZA YEH
0E33;THAI CHARACTER SARA AM
0EB3;LAO VOWEL SIGN AM
0F77;TIBETAN VOWEL SIGN VOCALIC RR
0F79;TIBETAN VOWEL SIGN VOCALIC LL
1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING
212E;ESTIMATED SYMBOL

###


John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1


Prepared by Robin Cover for the The SGML/XML Web Page archive.


Globe Image

Document URL: http://xml.coverpages.org/cowanUnicodeTC.html