W3C has announced the release of Character Model for the World Wide Web 1.0: Fundamentals as a Final W3C Recommendation. The Consortium also announced support for two other newly issued publications that are critical to increasing the international reach of the World Wide Web. Uniform Resource Identifier (URI): Generic Syntax and Internationalized Resource Identifiers (IRIs) were coordinated through both the Internet Engineering Task Force (IETF) and W3C.
The Character Model for the World Wide Web 1.0: Fundamentals Recommendation is one of three series publications defining the W3C Character Model. The goal of the Character Model for the World Wide Web is "to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access. One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way. The model will allow Web documents authored in the world's scripts (and on different platforms) to be exchanged, read, and searched by Web users around the world."
The W3C Character Model Recommendation "allows Web applications to transmit and process the characters of the world's languages. It provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web, building on the Universal Character Set (UCS), defined jointly by the Unicode Standard and ISO/IEC 10646. Topics addressed include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing."
W3C adopted Unicode "as the document character set for HTML in HTML 4.0. The same approach was later used for Recommendations such as XML 1.0 and CSS Level 2. W3C specifications and applications now use Unicode as the common reference character set."
Other series documents under development include Character Model for the World Wide Web 1.0: Resource Identifiers and Character Model for the World Wide Web 1.0: Normalization. The first is an architectural specification defining a common reference for the use of resource identifiers, and in particular, spefifying Internationalized Resource Identifiers (IRIs). The second provides a "common reference for early uniform normalization and string identity matching to improve interoperable text manipulation on the World Wide Web."
W3C has also announced support for the publication of the cooperatively produced IETF RFC 3987 on Internationalized Resource Identifiers (IRIs) as an IETF Proposed Standard, together with IETF STD 66 and RFC 3986, Uniform Resource Identifier (URI): Generic Syntax. URIs are fundamental to the Web: they are "simple text strings that refer to Internet resources — to documents, resources, people, and indirectly to anything. URIs are the glue that binds the Web together. IRIs extend and strengthen the glue, by allowing people to identify Web resources in their own language."
The IRIs specification was written by Martin Dürst (W3C) and Michel Suignard (Microsoft) with involvement of the W3C Internationalization Working Group. The IRI document defines a new protocol element named the Internationalized Resource Identifier (IRI) "as a complement to the Uniform Resource Identifier (URI)." URIs are composed of sequence of characters chosen from a limited subset of the repertoire of US-ASCII characters. The IRI design is motivated by a need to accommodate non-English languages in which natural scripts use characters other than simply 'A-Z' to compose URIs. IRIs allow characters in the Universal Character Set (Unicode/ISO 10646). IRIs allow content developers and users to identify resources such as Web pages in their own languages. The IRI specification will also provide a definitive reference for many W3C specifications such as XML, RDF, XHTML and SVG."
A mapping from IRIs to URIs is defined in 3987, "which means that IRIs can be used instead of URIs, where appropriate, to identify resources. The approach of defining a new protocol element was chosen instead of extending or changing the definition of URIs. This was done in order to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines are provided for the use and deployment of IRIs in various protocols, formats, and software components that currently deal with URIs."
The new Uniform Resource Identifier (URI): Generic Syntax specification was written by Tim Berners-Lee (W3C), Roy Fielding (Day Software) and Larry Masinter (Adobe) with involvement of the W3C Technical Architecture Group (TAG). "This new Standard replaces the URI specification released in 1998. Among several technical changes, the host component of a URI is now enabled for internationalized domain names. Other technical changes include a rule for absolute URIs with optional fragments, a rewritten section 6 'Normalization and Comparison' by Tim Bray and the W3C TAG, simplified grammar, clarifications for ambiguities, and revisions to the reserved set of characters."
On February 16, 2005 W3C also announced "the relaunch of the URI Activity. The new URI Interest Group, chaired by Dan Connolly (W3C) and Norman Walsh (Sun Microsystems), is chartered through 28-February-2007. The group reviews ongoing work related to Uniform Resource Identifiers (URIs) and Internationalized Resource Identifiers (IRIs) and helps to deploy quality implementations by maintaining testing materials. Participation is open to W3C Members and the public."
Bibliographic Information and Extracts
Character Model for the World Wide Web 1.0: Fundamentals. W3C Recommendation. 15-February-2005. Edited by Martin J. Dürst (W3C) François Yergeau (Invited Expert), Richard Ishida (W3C), Misha Wolf (Reuters Ltd; until Dec 2002), and Tex Texin (XenCraft, Invited Expert). Version URL: http://www.w3.org/TR/2005/REC-charmod-20050215/. Latest version URL: http://www.w3.org/TR/charmod/. Previous version URL: http://www.w3.org/TR/2004/PR-charmod-20041122/.
"Topics addressed in this part of the Character Model for the World Wide Web include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing. The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications. It defines conformance criteria for W3C specifications as well as other specifications. Other audiences of this specification include software developers, content developers, and authors of specifications outside the W3C. Software developers and content developers implement and use W3C specifications. This specification defines some conformance criteria for implementations (software) and content that implement and use W3C specifications. It also helps software developers and content developers to understand the character-related provisions in W3C specifications. The character model described in this specification provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a more international Web. Topics as yet not addressed or barely touched include fuzzy matching, and language tagging. Some of these topics may be addressed in a future version of this specification..."
Related to Character Model: Fundamentals
- "Character Model for the World Wide Web 1.0: Resource Identifiers." W3C Candidate Recommendation. 22-November-2004.
- "Character Model for the World Wide Web 1.0: Normalization." W3C Working Draft. 25-February-2004.
Internationalized Resource Identifiers (IRIs). By Martin Dürst (World Wide Web Consortium; WWW) and Michel Suignard (Microsoft Corporation; WWW). IETF Network Working Group. Request for Comments #3987. Category: Standards Track. Now a Proposed Standard Protocol. January 2005. 46 pages.
"A Uniform Resource Identifier (URI) is defined in RFC 3986 as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII characters. The characters in URIs are frequently used for representing words of natural languages. This usage has many advantages: Such URIs are easier to memorize, easier to interpret, easier to transcribe, easier to create, and easier to guess. For most languages other than English, however, the natural script uses characters other than A - Z. For many people, handling Latin characters is as difficult as handling the characters of other scripts is for those who use only the Latin alphabet. Many languages with non-Latin scripts are transcribed with Latin letters. These transcriptions are now often used in URIs, but they introduce additional ambiguities. The infrastructure for the appropriate handling of characters from local scripts is now widely deployed in local versions of operating system and application software. Software that can handle a wide variety of scripts and languages at the same time is increasingly common. Also, increasing numbers of protocols and formats can carry a wide range of characters. This document defines a new protocol element called Internationalized Resource Identifier (IRI) by extending the syntax of URIs to a much wider repertoire of characters. It also defines 'internationalized' versions corresponding to other constructs from RFC 3986, such as URI references..."
Uniform Resource Identifier (URI): Generic Syntax. By Tim Berners-Lee (World Wide Web Consortium; WWW), Roy T. Fielding (Day Software; WWW), and Larry Masinter (Adobe Systems Incorporated; WWW). IETF Network Working Group. Request for Comments #3986, and IETF STD 66. Updates RFC 1738, obsoletes RFCs 2732, 2396, 1808. January 2005. 61 pages.
Abstract: "A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. This specification defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, allowing an implementation to parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier. This specification does not define a generative grammar for URIs; that task is performed by the individual specifications of each URI scheme."
Background to the Design of the Character Model for the World Wide Web
From the REC section 1.2 Background: "Starting with Internationalization of the Hypertext Markup Language [RFC 2070], the Web community has recognized the need for a character model for the World Wide Web. The first step towards building this model was the adoption of Unicode as the document character set for HTML.
The choice of Unicode was motivated by the fact that Unicode:
- is the only universal character repertoire available
- provides a way of referencing characters independent of the encoding of the text
- is being updated/completed carefully
- is widely accepted and implemented by industry
W3C adopted Unicode as the document character set for HTML in HTML 4.0. The same approach was later used for specifications such as XML 1.0 and CSS2. W3C specifications and applications now use Unicode as the common reference character set.
When data transfer on the Web remained mostly unidirectional (from server to browser), and where the main purpose was to render documents, the use of Unicode without specifying additional details was sufficient. However, the Web has grown:
- Data transfers among servers, proxies, and clients, in all directions, have increased
- Characters outside the US-ASCII [ISO/IEC 646][MIME-charset] repertoire are being used in more and more places
- Data transfers between different protocol/format elements (such as element/attribute names, URI components, and textual content) have increased
- More and more APIs are defined, not just protocols and formats
In short, the Web may be seen as a single, very large application, rather than as a collection of small independent applications. While these developments strengthen the requirement that Unicode be the basis of a character model for the Web, they also create the need for additional specifications on the application of Unicode to the Web. Some aspects of Unicode that require additional specification for the Web include:
- Choice of Unicode encoding forms (UTF-8, UTF-16, UTF-32)
- Counting characters, measuring string length in the presence of variable-length character encodings and combining characters
- Duplicate encodings of characters (e.g., precomposed vs decomposed)
- Use of control codes for various purposes (e.g., bidirectionality control, symmetric swapping, etc.)
It should be noted that such aspects also exist in various encodings, and in many cases have been inherited by Unicode in one way or another from these encodings..."
From the W3C Announcements
The World Wide Web Consortium (W3C) has published the Character Model of the World Wide Web: Fundamentals as a W3C Recommendation. It provides a well-defined and well-understood way for Web applications to transmit and process the characters of the world's languages.
This architectural Recommendation gives authors of specifications, software developers, and content developers a common reference, enabling interoperable text manipulation on the World Wide Web. It builds on the Universal Character Set, defined jointly by the Unicode Standard and ISO/IEC 10646. Topics include use of the terms 'character', 'encoding' and 'string', a reference processing model, choice and identification of character encodings, character escaping, and string indexing.
The goal of the Character Model for the World Wide Web is to facilitate use of the Web by all people, regardless of their language, script, writing system, and cultural conventions, in accordance with the W3C goal of universal access.
Unicode Brings the Universal Character Set to the Web
At the core of the character model is the Universal Character Set (UCS). The model allows Web technologies to support text in the world's scripts (and on different platforms) and to be exchanged, read, and searched by Web users around the world. Unicode was chosen because it provides a way of referencing characters independent of the encoding of the text, it is being updated and completed carefully, and it is widely accepted and implemented by industry.
W3C adopted Unicode as the document character set for HTML in HTML 4.0. The same approach was later used for Recommendations such as XML 1.0 and CSS Level 2. W3C specifications and applications now use Unicode as the common reference character set.
New Specification Clarifies Character Usage on the Web
As the number of Web applications increases, the need for a shared character model has become more critical. Unicode is the natural choice as the basis for that shared model, especially as applications developers begin to consolidate their encoding options. However, applying Unicode to the Web requires additional specifications; this is the purpose of the W3C Character Model series.
Series Documents to Be Completed in 2005
Today's Recommendation is the first in a set of three documents. In development are "Character Model for the World Wide Web 1.0: Normalization," specifying early uniform normalization and string identity matching for text manipulation, and "Character Model for the World Wide Web 1.0: Resource Identifiers," specifying IRI conventions.
Industry Leaders Key in Development of Character Model Series
The Character Model was developed by the W3C Internationalization Activity's Working Group (now the W3C Internationalization Core Working Group) with the help of the W3C Internationalization Interest Group. W3C Members participating in the Working Group include BBC, Boeing, Ecole Mohammadia d'Ingénieurs, IBM, Microsoft, Siemens, Sun Microsystems, and webMethods.
URIs and Internationalized Resource Identifiers (IRIs)
The World Wide Web Consortium (W3C) announces its support for two newly issued publications that are critical to increasing the international reach of the World Wide Web. These publications, coordinated through both the IETF and W3C, are RFC 3986, STD 66 Uniform Resource Identifier (URI): Generic Syntax and RFC 3987 Internationalized Resource Identifiers (IRIs), respectively an Internet Engineering Task Force (IETF) Internet Standard and Proposed Standard.
URIs and IRIs Are the Glue That Holds the Web Together
The World Wide Web is defined as the universal, all-encompassing space containing all Internet — and other — resources referenced by Uniform Resource Identifiers (URIs, sometimes commonly called "URLs").
In Tim Berners Lee's original proposal, and in the initial Web implementation, the Web consisted of relatively few technologies, including the Hypertext Transfer Protocol (HTTP) and the HyperText Markup Language (HTML). Yet perhaps more fundamental than either HTTP or HTML are URIs, which are simple text strings that refer to Internet resources — documents, resources, people, and indirectly to anything. URIs are the glue that binds the Web together. IRIs extend and strengthen the glue, by allowing people to identify Web resources in their own language.
The IETF Internet Standards Process has produced thousands of publications, including approximately 60 Internet Standards. The URI specification is joining this small group. An Internet Standard (or "Standard") has a high degree of technical maturity and is believed to provide significant benefit to the Internet community. The newer of the two documents, the IRI specification, has been published as a Proposed Standard.
Fundamental Component of the Web Updated
Uniform Resource Identifier (URI): Generic Syntax was written by Tim Berners-Lee (Director, W3C), Roy Fielding (Day Software) and Larry Masinter (Adobe Systems) with involvement of the W3C Technical Architecture Group (TAG). The Standard describes the design, syntax, and resolution of URIs as well as security considerations and normalization and comparison (determining if two URIs are equivalent).
This new Standard replaces the URI specification released in 1998. Among several technical changes, the host component of a URI is now enabled for internationalized domain names. Other technical changes include a rule for absolute URIs with optional fragments, a rewritten section 6 "Normalization and Comparison" by Tim Bray and the W3C TAG, simplified grammar, clarifications for ambiguities, and revisions to the reserved set of characters.
IRIs Allow Internationalized Web Addressing
The Internationalized Resource Identifiers (IRIs) Proposed Standard was developed in part by the W3C Internationalization Working Group, and was written by Martin Dürst (W3C) and Michel Suignard (Microsoft Corporation).
With few exceptions, the natural scripts of the world's languages use characters other than A-Z. By expanding allowed characters from a subset of US-ASCII to the Universal Character Set (Unicode/ISO 10646), IRIs allow content developers and users to identify resources in their own languages. In addition, many W3C specifications — such as XML, RDF, XHTML and SVG — needed a definitive reference for identifiers that support international characters. The IRI specification provides that critical reference.
According to the IRI specification, every URI is already an IRI. As a result, URI users do not need to do anything differently in order to find what they need on the Web. The specification also discusses how to convert an IRI to a URI for resolution on existing systems, the special case of bidirectional IRIs, equivalence between IRIs, IRI use in different situations, security considerations and informative guidelines.
IETF and W3C Cooperation Produces Strong Results
These IETF documents are good examples of the longstanding cooperation between IETF and W3C.
Along with the HTTP specification, the URI specifications pre-date W3C, and are among the earliest documented Web work. As these specifications continue to be useful to many IETF efforts, their standardization continued within the IETF. The W3C URI Activity hosts discussion forums and provides editing resources and coordinates with other W3C Activities on Web technologies.
Principal references:
- Announcements:
- Announcement February 15, 2005: "World Wide Web Consortium Issues Critical Internationalization Recommendation. 'Character Model of the World Wide Web: Fundamentals' Brings Unified Approach to Using Characters on the Web." [source]
- W3C news item: Character Model
- Announcement January 26, 2005: "World Wide Web Consortium Supports the IETF URI Standard and IRI Proposed Standard. URI Specification Updated, IRIs Allow Internationalized Web Addressing." [source]
- W3C news item: URI Standard and IRI Proposed Standard
- RFC 3987 on Internationalized Resource Identifiers (IRIs). IETF announcement January 25, 2005. now a Proposed Standard Protocol.
- STD 66, RFC 3986 on Uniform Resource Identifier (URI): Generic Syntax. IETF announcement January 25, 2005. Now a Standard Protocol.
- Character Model for the World Wide Web 1.0: Fundamentals. W3C Recommendation.
- Character Model for the World Wide Web 1.0: Resource Identifiers
- Character Model for the World Wide Web 1.0: Normalization
- Uniform Resource Identifier (URI): Generic Syntax. IETF Network Working Group, Request for Comments #3986, and IETF STD 66. [source IETF]
- Internationalized Resource Identifiers (IRIs). IETF Network Working Group, Request for Comments #3987. [source IETF]
- An Introduction to Multilingual Web Addresses
- Uniform Resource Identifier (URI) Activity Statement
- W3C URI Home Page. Naming and Addressing: URIs, URLs...
- W3C Launches URI Interest Group
- W3C I18N:
- W3C Internationalization (I18N) Activity
- Internationalization Activity Statement
- W3C Internationalization Publications
- W3C Web Internationalization FAQs
- Internationalization (I18N) Working Group Review RADAR
- W3C Internationalization Core Working Group Home Page
- Internationalization Core Working Group Charter
- W3C Internationalization Interest Group:
- W3C Internationalization Interest Group Charter
- Internationalization Interest Group mailing list archives. See the description. Subscribe to the list by sending email to www-international-request@w3.org with subscribe in the 'Subject: ' line; post to www-international@w3.org.
- Contact: Richard Ishida (Team Contact, W3C Internationalization Working Group).
- Earlier news:
- "W3C Forms Internationalization Tag Set Working Group with Rechartered I18N Activity."
- "W3C I18N Working Group Publishes Last Call Working Draft for the WWW Character Model."
- "IESG Announces Last Call Review for IETF Internet Drafts on URIs and IRIs."
- "W3C Publishes Draft Guidelines for Authoring Internationalized XHTML and HTML."
- "W3C Index of Translations Showcases RDF and Internationalization Technologies."
- Press:
- "W3C, IETF Stick with 'Web Glue' Standards." By Clint Boulton. From Internetnews.com (January 26, 2005).
- "W3C Shows Why Innovation Needs Standards." From ZDNet UK (January 27, 2005).
- General references: