The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: August 18, 2004.
News: Cover StoriesPrevious News ItemNext News Item

IESG Announces Last Call Review for IETF Internet Drafts on URIs and IRIs.

Recent 'Last Call' postings from the Internet Engineering Steering Group announce the IESG's intention to make decisions on the approval of two URI-related IETF Internet Drafts within the next few weeks.

The Internationalized Resource Identifiers (IRIs) draft is being considered for approval as an IETF Proposed Standard. The IESG solicits final comments on the proposed action by 2004-09-08. Uniform Resource Identifier (URI): Generic Syntax is under consideration for approval as an IETF Full Standard. Comments on this Internet Draft are invited through 2004-09-13.

The IRI document defines a new protocol element named the Internationalized Resource Identifier (IRI) "as a complement to the Uniform Resource Identifier (URI)." URIs are composed of sequence of characters chosen from a limited subset of the repertoire of US-ASCII characters. The IRI design is motivated by a need to accommodate non-English languages in which natural scripts use characters other than simply 'A-Z' to compose URIs.

An IRI is a "sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs where appropriate to identify resources. The approach of defining a new protocol element was chosen, instead of extending or changing the definition of URIs, to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines for the use and deployment of IRIs in various protocols, formats, and software components that now deal with URIs are provided."

The World Wide Web, according to W3C's Uniform Resource Identifier (URI) Activity Statement, is a "universal, all-encompassing space containing all Internet resources referenced by Uniform Resource Identifier (URI). The Web is dominated today by relatively few technologies, including the HyperText Transfer Protocol (HTTP) and the HyperText Markup Language (HTML). Perhaps more fundamental than either HTTP or HTML are the URIs, which are simple text strings that refer to Internet resources. URIs may refer to documents, resources, to people, and indirectly to anything. Document formats and protocols may come and go, but URIs will remain as the glue that binds the Web together."

The new URI Generic Syntax document, if approved, will update IETF RFC 1738 and obsolete RFCs 2732, 2396, 1808. A Uniform Resource Identifier (URI) is "a compact sequence of characters for identifying an abstract or physical resource. The Generic Syntax specification "defines the generic URI syntax and a process for resolving URI references that might be in relative form, along with guidelines and security considerations for the use of URIs on the Internet. The URI syntax defines a grammar that is a superset of all valid URIs, such that an implementation can parse the common components of a URI reference without knowing the scheme-specific requirements of every possible identifier. The specification does not define a generative grammar for URIs; that task is performed by the individual specifications of each URI scheme."

Appendix D of the Generic Syntax I-D provides a summary of non-editorial changes. "IPv6 (and later) literals have been added to the list of possible identifiers for the host portion of a authority component. Section 6 on URI normalization and comparison has been completely rewritten and extended using input from Tim Bray and discussion within the W3C Technical Architecture Group (TAG)."

Bibliographic Information

Overview and Motivation for IRIs

"A Uniform Resource Identifier (URI) is defined in [IETF RFC] as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII characters.

The characters in URIs are frequently used for representing words of natural languages. Such usage has many advantages: such URIs are easier to memorize, easier to interpret, easier to transcribe, easier to create, and easier to guess. For most languages other than English, however, the natural script uses characters other than A-Z. For many people, handling Latin characters is as difficult as handling the characters of other scripts is for people who use only the Latin alphabet. Many languages with non-Latin scripts have transcriptions to Latin letters. Such transcriptions are now often used in URIs, but they introduce additional ambiguities.

The infrastructure for the appropriate handling of characters from local scripts is now widely deployed in local versions of operating system and application software. Software that can handle a wide variety of scripts and languages at the same time is increasingly widespread. Also, there are increasing numbers of protocols and formats that can carry a wide range of characters.

[The IRI Intenet Draft] document defines a new protocol element, called Internationalized Resource Identifier (IRI), by extending the syntax of URIs to a much wider repertoire of characters. It also defines 'internationalized' versions corresponding to other constructs from [the URI RFC], such as URI references...

IRIs are designed to be compatible with recent recommendations for new URI schemes. The compatibility is provided by specifying a well defined and deterministic mapping from the IRI character sequence to the functionally equivalent URI character sequence. Practical use of IRIs (or IRI references) in place of URIs (or URI references) depends on the following conditions being met:

  • The protocol or format element where IRIs are used should be explicitly designated to be able to carry IRIs. That is, the intent is not to introduce IRIs into contexts that are not defined to accept them. For example, XML schema has an explicit type 'anyURI' that includes IRIs and IRI references. Therefore, IRIs and IRI references can be in attributes and elements of type 'anyURI'. On the other hand, in the HTTP protocol [RFC 2616], the Request URI is defined as an URI, which means that direct use of IRIs is not allowed in HTTP requests.
  • The protocol or format carrying the IRIs should have a mechanism to represent the wide range of characters used in IRIs, either natively or by some protocol- or format-specific escaping mechanism (for example numeric character references in XML)...
  • The URI corresponding to the IRI in question has to encode original characters into octets using UTF-8. For new URI schemes, this is recommended in [RFC 2718]. It can apply to a whole scheme (e.g., IMAP URLs [RFC 2192] and POP URLs [RFC 2384], or the URN syntax of RFC 2141). It can apply to a specific part of a URI, such as the fragment identifier (e.g., [XPointer]). It can apply to a specific URI or part(s) thereof..." [from the IRI draft]

Overview of URIs

"URIs are characterized as follows:

  • Uniform: Uniformity provides several benefits: it allows different types of resource identifiers to be used in the same context, even when the mechanisms used to access those resources may differ; it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers; it allows introduction of new types of resource identifiers without interfering with the way that existing identifiers are used; and, it allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large, and widely-used set of resource identifiers.

  • Resource: This specification does not limit the scope of what might be a resource; rather, the term 'resource' is used in a general sense for whatever might be identified by a URI. Familiar examples include an electronic document, an image, a source of information with consistent purpose (e.g., 'today's weather report for Los Angeles'), a service (e.g., an HTTP to SMS gateway), a collection of other resources, and so on. A resource is not necessarily accessible via the Internet; e.g., human beings, corporations, and bound books in a library can also be resources. Likewise, abstract concepts can be resources, such as the operators and operands of a mathematical equation, the types of a relationship (e.g., 'parent' or 'employee'), or numeric values (e.g., zero, one, and infinity).

  • Identifier: An identifier embodies the information required to distinguish what is being identified from all other things within its scope of identification. Our use of the terms 'identify' and 'identifying' refer to this purpose of distinguishing one resource from all other resources, regardless of how that purpose is accomplished (e.g., by name, address, context, etc.). These terms should not be mistaken as an assumption that an identifier defines or embodies the identity of what is referenced, though that may be the case for some identifiers. Nor should it be assumed that a system using URIs will access the resource identified: in many cases, URIs are used to denote resources without any intention that they be accessed. Likewise, the 'one' resource identified might not be singular in nature (e.g., a resource might be a named set or a mapping that varies over time)...'

A URI is an identifier, consisting of a sequence of characters matching the syntax rule named <URI> in Section 3, that enables uniform identification of resources via a separately defined, extensible set of naming schemes. How that identification is accomplished, assigned, or enabled is delegated to each scheme specification. This specification does not place any limits on the nature of a resource, the reasons why an application might wish to refer to a resource, or the kinds of system that might use URIs for the sake of identifying resources. This specification does not require that a URI persists in identifying the same resource over all time, though that is a common goal of all URI schemes. Nevertheless, nothing in this specification prevents an application from limiting itself to particular types of resources, or to a subset of URIs that maintains characteristics desired by that application.

URIs have a global scope and are interpreted consistently regardless of context, though the result of that interpretation may be in relation to the end-user's context. For example, http://localhost/ has the same interpretation for every user of that reference, even though the network interface corresponding to 'localhost' may be different for each end-user: interpretation is independent of access. However, an action made on the basis of that reference will take place in relation to the end-user's context, which implies that an action intended to refer to a single, globally unique thing must use a URI that distinguishes that resource from all other things. URIs that identify in relation to the end-user's local context should only be used when the context itself is a defining aspect of the resource, such as when an on-line help manual refers to a file on the end-user's filesystem.." [from the I-D proposed as RFC]

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: