Update: http://xml.coverpages.org/draft-langtags-phillips-davis-00.txt From: http://www.ietf.org/internet-drafts/draft-langtags-phillips-davis-00.txt (ephemeral URL) Title: Tags for Languages Reference: draft-langtags-phillips-davis-00.txt See also: http://xml.coverpages.org/languageIdentifiers.html Language Identifiers in the Markup Context ------------------------------------------------------------------------ INTERNET-DRAFT Category: Network Addison Phillips, Mark Davis Title: Tags for Languages Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at: http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at: http://www.ietf.org/shadow.html. This document is an individual contribution for consideration by the Network Working Group of the Internet Engineering Task Force. Comments should be submitted to the ietf-languages@alvestrand.no list. This document expires in November 2004. Abstract This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags, including user defined extensions for private interchange. 1. Introduction Human beings on our planet have, past and present, used a number of languages. There are many reasons why one would want to identify the language used when presenting information. In some contexts, it is possible to have information available in more than one language, or it might be possible to provide tools (such as dictionaries) to assist in the understanding of a language. Also, many types of information processing require knowledge of the language in which information is expressed in order for that process to be performed on the information; for example spell-checking, computer-synthesized speech, Braille, or high-quality print renderings. One means of indicating the language used is by labeling the Phillips/Davis INTERNET-DRAFT [Page 1] Tags for Identification of Languages November 2003 information content with an identifier for the language that is used in this information content. These labels can also be used to specify user preferences when selecting information content, or for labeling additional attributes of content. In particular, it is often necessary to define additional specific information about the dialect, writing system, or orthography used in a document, as this information may be useful to specialists or may be important in understanding the structure of the information and ways in which it should be processed. This document specifies an identifier mechanism, a registration function for values to be used with that identifier mechanism, and a construct for matching against those values. It also defines a mechanism for private use extension and how private use, registered values, and matching interact. The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119]. 2. The Language tag 2.1 Language tag syntax The language tag is composed of one or more parts: A primary language subtag and a (possibly empty) series of subsequent subtags. The subtags have a specific structure that depends on the length of the subtag to distinguish each tag type. The syntax of this tag in ABNF [RFC 2234] is: language_tag = lang ["-" script] ["-" region] *("-" variant) [extensions] =/ "x" *("-" alphanum) =/ grandfathered_registrations lang = shortest_alpha_ISO_639_code =/ registered_lang script = ISO_15924_code region = shortest_alpha_ISO_3166_code =/ registered_region variant = year =/ registered_variant Phillips/Davis INTERNET-DRAFT [Page 2] Tags for Identification of Languages November 2003 registered_lang = 5*16 alphanum registered_variant = 5*16 alphanum extensions = "-x" 1* ("-" key "=" value) value = * alphanum alphanum = (ALPHA / DIGIT) year = * DIGIT The character "-" is HYPHEN-MINUS (ABNF: %x2D). The character "=" is EQUALS SIGN (ABNF: %x3D). All tags are to be treated as case insensitive: there exist conventions for the capitalization of some of them, but these should not be taken to carry meaning. For instance, [ISO 3166] recommends that country codes are capitalized (MN Mongolia), while [ISO 639] recommends that language codes are written in lower case (mn Mongolian). 2.2 Language tag sources The namespace of language tags is administered by the Internet Assigned Numbers Authority (IANA) [RFC 2860] according to the rules in section 3 of this document. All of the rules in this section apply to the various subtags within language tags defined in this document, excepting those "grandfathered" tags defined in section 2.2.1 of this document. Those tags should be considered as atomic exceptions to the rules presented here. The following rules apply to the primary (language) subtag: - All 2-letter language subtags are interpreted according to assignments found in ISO standard 639, "Code for the representation of names of languages" [ISO 639], or assignments subsequently made by the ISO 639 Part 1 maintenance agency or governing standardization bodies. - All 3-letter language subtags are interpreted according to assignments found in ISO 639 part 2, "Codes for the representation of names of languages -- Part 2: Alpha-3 code [ISO 639-2]", or assignments subsequently made by the ISO 639 part 2 maintenance agency or governing standardization bodies. - ISO639 reserves for private use tags in the range 'qaa' through 'qtz'. These tags should be used for non-registered base language values. - Additional subtags of 5 to 16 letters may be registered with IANA, according to the rules in chapter 5 of this document. Registered tags must not begin with the letter 'x', which is reserved for private use Phillips/Davis INTERNET-DRAFT [Page 3] Tags for Identification of Languages November 2003 tags. (Note that previously, in rfc3066, the IANA registry contained whole tag registrations such as de-CH-1994, whereas this document refers to the registration of subtags such as 'klingon') - The single letter "x" indicates a complete private use language tag. Users of private use subtags may define the semantics of the tags in any way that they desire. As such, the semantic meaning should be considered external to this document. - Other values shall not be assigned except by revision of this standard. The following rules apply to the script subtags: - All 4-letter subtags are interpreted as ISO 15924 alpha-4 script codes from [ISO 15924], or subsequently assigned by the ISO 15924 maintenance agency or governing standardization bodies, denoting the script or writing system used in conjunction with this language. These alpha4 tags may only occur as the second subtag in a tag. For example: 'de-Latn" represents German written using Latin script. The following rules apply to the region subtags: - All 2-letter and 3-letter subtags are interpreted as ISO 3166 alpha- 2 (or alpha-3) country codes from [ISO 3166], or subsequently assigned by the ISO 3166 maintenance agency or governing standardization bodies, denoting the area to which this language variant relates. Region tags must occur after any script tags and before any variants or extensions. The shortest form tag must be used. At the time this document was written, all alpha3 codes had a corresponding alpha2 code. Use of alpha3 codes is provided for a foreseeable future in which alpha2 codes have been exhausted. Example: 'de-Latn-CH' represents German written using Latin script for Switzerland. - ISO 3166 reserves the country codes AA, QM-QZ, XA-XZ and ZZ (plus any three-letter sequences starting with these codes) as user-assigned codes. These tags should be used for non-registered private use region values. The following rules apply to the variant subtags: - Additional subtags of 5 to 16 letters may be registered with IANA, according to the rules in chapter 5 of this document. Registered tags must not begin with the letter 'x', which is reserved for private use tags. (Note that previously, in rfc3066, the IANA registry contained whole tag registrations such as 'en-boont', whereas this document refers to the registration of subtags such as 'boont') - Additional subtags of 5 to 16 letters starting with 'x' are reserved for private use. The semantics of these tags must be defined by the end users of such subtags and the semantic meaning should be considered external to this document. The following rules apply to the extensions: - No source is defined for extensions. External agreement or Phillips/Davis INTERNET-DRAFT [Page 4] Tags for Identification of Languages November 2003 standardization of extension subtags is by private agreement and should not be considered part of this document. 2.2.1 Pre-Existing RFC 3066 Tag Registrations Existing IANA registered language tags from RFC1766/RFC3066 that are not defined by additions to this document maintain their validity. IANA will maintain these tags, adding a notation that they are "grandfathered from RFC 3066". The rules governing existing RFC 1766 and RFC 3066 registered tags are: - If the registered tag would now be defined by this document, then the existing tag is marked as superseded by this document and no subtag will be registered as a result. For example, zh-Hans would now be defined by the addition of ISO 15924 script codes. - If the registered tag contains one or more subtags that follow the guidelines for registered language or variant subtags, and all of the subtags are either now defined by this document or would be valid to register, then each subtag not already covered by this document will be registered automatically by IANA without further review and the existing tag marked as superseded by this document. For example: the tag 'en-boont' fits the pattern for a registered variant. The variant subtag "boont" will be registered automatically and 'en-boont' marked as superseded. - If the registered tag contains any subtags that are not otherwise valid for registration according to the rules in this document, then the tag as a whole is maintained as an exceptional case. This includes special cases of Sign Language tags. For example, the tag 'i-klingon' is not covered by any addition and is grandfathered, as is sgn-US-MA (Martha's Vineyard Sign Language, which is found in the state of Massachusetts, USA). Users of tags that are grandfathered should consider registering tags using the new format (but are not required to). 2.2.2 Possibilities for registration consideration include: - Languages not listed in ISO 639 that are not variants of any listed language, can be registered with the "x" prefix, such as xtsolyani. Before attempting to register a language tag, there should be a good faith attempt to register the language with ISO 639. No language tags will be registered for tags that exist in ISO 639-1 or ISO 639-2. - Dialect or other divisions or variations within a language, its orthography, writing system, regional variation, or historical usage, such as en-scouse (the Scouse dialect of English). This document leaves the decision on what tags are appropriate or not to the registration process described in section 3. ISO 639 defines a maintenance agency for additions to and changes in the list of languages in ISO 639. This agency is: Phillips/Davis INTERNET-DRAFT [Page 5] Tags for Identification of Languages November 2003 International Information Centre for Terminology (Infoterm) P.O. Box 130 A-1021 Wien Austria Phone: +43 1 26 75 35 Ext. 312 Fax: +43 1 216 32 72 ISO 639-2 defines a maintenance agency for additions to and changes in the list of languages in ISO 639-2. This agency is: Library of Congress Network Development and MARC Standards Office Washington, D.C. 20540 USA Phone: +1 202 707 6237 Fax: +1 202 707 0115 URL: http://www.loc.gov/standards/iso639 The maintenance agency for ISO 3166 (country codes) is: ISO 3166 Maintenance Agency Secretariat c/o DIN Deutsches Institut fuer Normung Burggrafenstrasse 6 Postfach 1107 D-10787 Berlin Germany Phone: +49 30 26 01 320 Fax: +49 30 26 01 231 URL: http://www.din.de/gremien/nas/nabd/iso3166ma/ The registration authority for ISO 15924 (script codes) is: Unicode Consortium Box 391476, Mountain View, CA 94039-1476, USA URL: http://www.unicode.org/iso15924 2.3 Choice of language tag One may occasionally be faced with several possible tags for the same body of text. Interoperability is best served if all users send the same tag, and use the same tag for the same language for all documents. If an application has requirements that make the rules here inapplicable, the application protocol specification MUST specify how the procedure varies from the one given here. The text below is based on the set of tags known to the tagging entity. 1. Use the most precise tag known to the sender that can be ascertained and is useful within the application context. 2. When a language has both an ISO 639-1 2-character code and an ISO 639-2 3-character code, you MUST use the ISO 639-1 2-character code. 3. When a language has no ISO 639-1 2-character code, and the ISO 639- 2/T (Terminology) code and the ISO 639-2/B (Bibliographic) codes differ, you MUST use the Terminology code. NOTE: At present all languages that have both kinds of 3-character code also are assigned a 2-character code, and the displeasure of developers about the existence of two different code sets has been adequately communicated Phillips/Davis INTERNET-DRAFT [Page 6] Tags for Identification of Languages November 2003 to ISO. So this situation will hopefully not arise. 4. You SHOULD NOT use the UND (Undetermined) code unless the protocol in use forces you to give a value for the language tag, even if the language is unknown. Omitting the tag is preferred. 5. You SHOULD NOT use the MUL (Multiple) tag if the protocol allows you to use multiple languages, as is the case for the Content-Language header in HTTP. NOTE: In order to avoid versioning difficulties in applications such as that experienced in RFC 1766, the ISO 639 Registration Authority Joint Advisory Committee (RA-JAC) has agreed on the following policy statement: "After the publication of ISO/DIS 639-1 as an International Standard, no new 2-letter code shall be added to ISO 639-1 unless a 3-letter code is also added at the same time to ISO 639-2. In addition, no language with a 3-letter code available at the time of publication of ISO 639-1 which at that time had no 2-letter code shall be subsequently given a 2-letter code." This will ensure that, for example, a user who implements "hwi" (Hawaiian), which currently has no 2-letter code, will not find his or her data invalidated by eventual addition of a 2-letter code for that language." 6. To maintain backwards compatibility, there are two provisions to account for instabilities in ISO 639, 3166, and 15924 codes. a. Ambiguity. In the event that one of these standards reassigns a code that was previously assigned to a different value, the new use of the code will not be permitted and the IANA registry, as soon as practical, will register a surrogate value for the new code, based on the year that the new code assignment was made. For example: cs-CS sr-CS2003 b. Stability. All other ISO codes are valid, even if they have been deprecated. At the time of this writing, this includes the following list. Where a new equivalent code has been defined (given below on the right side after a tilde), implementations should treat these tags as identical. Deprecated ISO 639 codes iw ~ he in ~ id ji ~ yi Phillips/Davis INTERNET-DRAFT [Page 7] Tags for Identification of Languages November 2003 Deprecated ISO 3166 codes FX TP ~ TL YU 2.4 Meaning of the language tag The language tag always defines a language as spoken (or written, signed or otherwise signaled) by human beings for communication of information to other human beings. Computer languages such as programming languages are explicitly excluded. If a language tag B contains language tag A as a prefix, then B should typically be "narrower" or "more specific" than A. However, this relationship is not guaranteed. There is also no guaranteed relationship between languages whose tags begin with the same series of subtags; specifically, they are NOT guaranteed to be mutually intelligible, although they may be. The relationship between the tag and the information it relates to is defined by the standard describing the context in which it appears. Accordingly, this section can only give possible examples of its usage. - For a single information object, it could be taken as the set of languages that is required for a complete comprehension of the complete object. Example: Plain text documents. - For an aggregation of information objects, it should be taken as the set of languages used inside components of that aggregation. Examples: Document stores and libraries. - For information objects whose purpose is to provide alternatives, the set of tags associated with it should be regarded as a hint that the content is provided in several languages, and that one has to inspect each of the alternatives in order to find its language or languages. In this case, a tag with multiple languages does not mean that one needs to be multi-lingual to get complete understanding of the document. Example: MIME multipart/alternative. - In markup languages, such as HTML and XML, language information can be added to each part of the document identified by the markup structure (including the whole document itself). For example, one could write C'est la vie. inside a Norwegian document; the Norwegian-speaking user could then access a French- Norwegian dictionary to find out what the marked section meant. If the user were listening to that document through a speech synthesis interface, this formation could be used to signal the synthesizer to Phillips/Davis INTERNET-DRAFT [Page 8] Tags for Identification of Languages November 2003 appropriately apply French text-to-speech pronunciation rules to that span of text, instead of misapplying the Norwegian rules. 2.5 Language-range Since the publication of RFC 3066, it has become apparent that there is a need to define a term for a set of languages whose tags all begin with the same sequence of subtags. The following definition of language-range is derived from HTTP/1.1 [RFC 2616]. language-range = language-tag / "*" That is, a language-range has the same syntax as a language-tag, or is the single character "*". A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first character following the prefix is "-". The special range "*" matches any tag. A protocol which uses language ranges may specify additional rules about the semantics of "*"; for instance, HTTP/1.1 specifies that the range "*" matches only languages not matched by any other range within an "Accept-Language:" header. NOTE: This use of a prefix matching rule does not imply that language tags are assigned to languages in such a way that it is always true that if a user understands a language with a certain tag, then this user will also understand all languages with tags for which this tag is a prefix. The prefix rule simply allows the use of prefix tags if this is the case. 3. IANA registration procedure for language tags The procedure given here MUST be used by anyone who wants to use a subtag not given an interpretation in chapter 2.2 of this document or previously registered with IANA. This procedure MAY also be used to register information with the IANA about a tag or subtag defined by this document, for instance if one wishes to make publicly available a reference to the definition for a language such as sgn-US (American Sign Language). Tags with a first subtag of "x" cannot be registered as they are reserved for private use. The process starts by filling out the registration form reproduced below. ---------------------------------------------------------------------- LANGUAGE SUBTAG REGISTRATION FORM Name of requester: Phillips/Davis INTERNET-DRAFT [Page 9] Tags for Identification of Languages November 2003 E-mail address of requester: Subtag to be registered: Whether subtag is a lang subtag or a variant subtag: Full English name of subtag: Intended meaning of the subtag: If variant subtag, the intended prefix of subtag: Native name of language (transcribed into ASCII): Reference to published description of the language (book or article): Any other relevant information: [TBD: have section to explain the above form. E.g. Where a variant subtag is intended for use with a particular prefix, such as "boont" with the prefix "en-", this should be stated.] The subtag registration form must be sent to for a two week review period before it can be submitted to IANA. (This is an open list. Requests to be added should be sent to .) When the two week period has passed, the subtag reviewer, who is appointed by the IETF Applications Area Director, either forwards the request to IANA@IANA.ORG, or rejects it because of significant objections raised on the list. Note that the reviewer can raise objections on the list himself, if he so desires. The important thing is that the objection must be made publicly. The applicant is free to modify a rejected application with additional information and submit it again; this restarts the two week comment period. Decisions made by the reviewer may be appealed to the IESG [RFC 2028] under the same rules as other IETF decisions [RFC 2026]. All registered forms are available online in the directory http://www.iana.org/numbers.html under "languages". Updates of registrations follow the same procedure as registrations. The subtag reviewer decides whether to allow a new registrant to update a registration made by someone else; normally objections by the original registrant would carry extra weight in such a decision. Registrations are permanent and stable. When some registered tag should not be used any more, for instance because a corresponding ISO 639 code has been registered, the registration should be amended by adding a remark like "DEPRECATED: use instead" to the "other relevant information" section. Phillips/Davis INTERNET-DRAFT [Page 10] Tags for Identification of Languages November 2003 Note: The purpose of the "published description" is intended as an aid to people trying to verify whether a language is registered, or what language a particular subtag refers to. In most cases, reference to an authoritative grammar or dictionary of that language will be useful; in cases where no such work exists, other well known works describing that language or in that language may be appropriate. The subtag reviewer decides what constitutes "good enough" reference material. 4. Security Considerations The only security issue that has been raised with language tags since the publication of RFC 1766, which stated that "Security issues are believed to be irrelevant to this memo", is a concern with language ranges used in content negotiation - that they may be used to infer the nationality of the sender, and thus identify potential targets for surveillance. This is a special case of the general problem that anything you send is visible to the receiving party; it is useful to be aware that such concerns can exist in some cases. The evaluation of the exact magnitude of the threat, and any possible countermeasures, is left to each application protocol. 5. Character set considerations Language tags may always be presented using the characters A-Z, a-z, 0- 9, EQUALS SIGN, and HYPHEN-MINUS, which are present in most character sets, so presentation of language tags should not have any character set issues. The issue of deciding upon the rendering of a character set based on the language tag is not addressed in this memo; however, it is thought impossible to make such a decision correctly for all cases unless means of switching language in the middle of a text are defined (for example, a rendering engine that decides font based on Japanese or Chinese language may produce sub-optimal output when a mixed Japanese- Chinese text is encountered) 6. Acknowledgements This is the first draft of this document. Any list of contributors is bound to be incomplete; please regard the following as only a selection from the group of people who have contributed to make this document what it is today. The contributors to RFC 3066 and RFC 1766, the precursors of this document, made enormous contributions directly or indirectly to this document and are generally responsible for the success of language tags. The following acknowledgements made in RFC 3066 apply to this document as well: In alphabetical order: Glenn Adams, Tim Berners-Lee, Marc Blanchet, Nathaniel Borenstein, Eric Brunner, Sean M. Burke, John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin, Dave Crocker, Mark Davis, Martin Duerst, Phillips/Davis INTERNET-DRAFT [Page 11] Tags for Identification of Languages November 2003 Michael Everson, Ned Freed, Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Paul Hoffman, Olle Jarnefors, Kent Karlsson, John Klensin, Alain LaBonte, Chris Newman, Keith Moore, Masataka Ohta, Keld Jorn Simonsen, Otto Stolz, Rhys Weatherley, Misha Wolf, Francois Yergeau and many, many others. Special thanks must go to Michael Everson, who has served as language tag reviewer for almost the complete period since the publication of RFC 1766. 7. Authors' Addresses Addison P. Phillips, webMethods, Inc., 432 Lakeside Drive, Sunnyvale, CA, 94088, USA Phone: +1 408 962-5487 EMail: aphillips@webmethods.com Mark Davis, IBM, Email: mark.davis@us.ibm.com 8. References [ISO 639] ISO 639:1988 (E/F) - Code for the representation of names of languages - The International Organization for Standardization, 1st edition, 1988-04-01 Prepared by ISO/TC 37 - Terminology (principles and coordination). Note that a new version (ISO 639-1: 2000) is in preparation at the time of this writing. [ISO 639-2] ISO 639-2:1998 - Codes for the representation of names of languages -- Part 2: Alpha-3 code - edition 1, 1998-11- 01, 66 pages, prepared by a Joint Working Group of ISO TC46/SC4 and ISO TC37/SC2. [ISO 3166] ISO 3166:1988 (E/F) - Codes for the representation of names of countries - The International Organization for Standardization, 3rd edition, 1988-08-15. [ISO 15924] ISO 15924:2003 (E/F) - Codes for the representation of names of scripts - The International Orangiazation for Standardization, 2003-03-04, prepared by ISO TC46/WG3 and Michael Everson. [RFC 1327] Kille, S., "Mapping between X.400 (1988) / ISO 10021 and RFC 822", RFC 1327, May 1992. [RFC 1521] Borenstein, N., and N. Freed, "MIME Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, September 1993. [RFC 2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC 2028] Hovey, R. and S. Bradner, "The Organizations Involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. [RFC 2119] Bradner, S."Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Phillips/Davis INTERNET-DRAFT [Page 12] Tags for Identification of Languages November 2003 [RFC 2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [RFC 2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [RFC 2860] Carpenter, B., Baker, F. and M. Roberts, "Memorandum of Understanding Concerning the Technical Work of the Internet Assigned Numbers Authority", RFC 2860, June 2000. Appendix A: Language Tag Reference Material The Library of Congress, maintainers of ISO 639-2, has made the list of languages registered available on the Internet. At the time of this writing, it can be found at http://www.loc.gov/standards/iso639-2/langhome.html The IANA registration forms for registered language codes can be found at http://www.iana.org/numbers.html under "languages". The ISO 3166 Maintenance Agency has published Web pages at http://www.din.de/gremien/nas/nabd/iso3166ma/ Appendix B: Examples of Language Codes (Informative) Simple language code: de (German), fr (French), ja (Japanese) Language code plus Script code : zh-Hant (Traditional Chinese), en-Latn (English written in Latin script), sr-Cyrl (Serbian written with Cyrillic script) Language-Script-Region: zh-Hans-CN (Simplified Chinese for the PRC) sr-Latn-CS2003 (Serbian, Latin script, Serbia and Montenegro) Language-Script-Region-Variant: en-Latn-US-boont (Boontling dialect of English) Other Mixtures: zh-CN (Chinese for the PRC) en-boont (Boontling dialect of English) Extension mechanism: de-CH-x-collation=phonebook az-Arab-x-SIL=AZE-dialect=derbend EXPIRATION: This document expires in November 2004. Phillips/Davis INTERNET-DRAFT [Page 13] INTERNET-DRAFT Tags for Identification of Languages November 2003