The W3C Internationalization Working Group has issued a second Last Call Working Draft specification defining a Character Model for the World Wide Web 1.0. The document is an Architectural Specification designed to provide "a common reference for interoperable text manipulation on the World Wide Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions, building on the Universal Character Set, defined jointly by Unicode and ISO/IEC 10646. Some introductory material on characters and character encodings is also provided." The goal of the specification is to "facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access; one basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way." The W3C I18N Working Group invites comments on the specification through the end of the review period, May 31, 2002. "Due to the architectural nature of this document, it affects a large number of W3C Working Groups, but also software developers, content developers, and writers and users of specifications outside the W3C that have to interface with W3C specifications. Because review comments play an important role in ensuring a high quality specification, the WG encourages readers to review this Last Call Working Draft carefully."
Bibliographic information: Character Model for the World Wide Web 1.0. W3C Working Draft 30-April-2002. Edited by Martin J. Dürst (W3C), François Yergeau (Alis Technologies), Richard Ishida (Xerox Global Services), Misha Wolf (Reuters Ltd.), Asmus Freytag (ASMUS, Inc.), and Tex Texin (Progress Software Corp.). Version URL: http://www.w3.org/TR/2002/WD-charmod-20020430. Latest version URL: http://www.w3.org/TR/charmod. Previous version URL: http://www.w3.org/TR/2002/WD-charmod-20020220. Also in XML and ZIP archive formats.
From the I18N activity statement: "The Character Model gives directions and guidelines for other W3C Recommendations, to make sure that the internationalization features of the various W3C specifications fit together, and that the common base of the Web architecture is clearly defined. A core point of the W3C Character Model is the understanding that more and more the Web as a whole has to be seen as a single application. The W3C Character Model therefore assumes that any kind of text processing is done using the Universal Character Set (UCS), defined jointly by The Unicode Standard and ISO/IEC 10646. When text data is actually transmitted over a network, various character encodings may be used, but it always has to be made clear what character encoding is used. For consistent behavior on the Web, in some cases, additional specifications associated with ISO 10646/Unicode are needed. One example is the way that characters which may consist of more than one element (such as accented characters) are transmitted. For example, a character like é might be sent as an 'e' character and a separate acute accent, or as a single character. On the Web, one of these methods of sending the character should be considered the 'normal' form, so that processing software can deal more predictably with these characters. The Character Model specifies the 'normal' form, based on work by the Unicode Consortium. For the example above, using a single character is the normal form. The character model also discusses how to count characters in various circumstances, and defines a consistent and backwards-compatible way to include non-ASCII characters in URIs..."
From the document Introduction:
All W3C specifications must conform to this document; see section 2 on 'Conformance'. Authors of other specifications, for example, IETF specifications, are strongly encouraged to take guidance from it.
The main target audience of this document is W3C specification developers. This document defines conformance requirements for other W3C specifications. This document and parts of it can also be referenced from other W3C specifications.
Other audiences of this document include software developers, content developers, and authors of specifications outside the W3C. Software developers and content developers implement and use W3C specifications. This document defines some conformance requirements for software developers and content developers that implement and use W3C specifications. It also helps software developers and content developers to understand the character-related provisions in other W3C specifications.
The character model described in this document provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a more international Web.
Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions. Some introductory material on characters and character encodings is also provided. Topics not addressed or barely touched include collation (sorting), fuzzy matching and language tagging. Some of these topics may be addressed in a future version of this specification.
At the core of the model is the Universal Character Set (UCS), defined jointly by The Unicode Standard and ISO/IEC 10646. In this document, Unicode is used as a synonym for the Universal Character Set. The model will allow Web documents authored in the world's scripts (and on different platforms) to be exchanged, read, and searched by Web users around the world.