[Mirrored from: http://www.crpht.lu/~carrasco/winter/draft0.html]

INTERNET-DRAFT                                     M.T. Carrasco Benitez
<draft-benitez-winter-cultures-00.txt>
Expires November 16th 1996                                May 16th, 1996

WInter
(Web Internationalization & Multilinguism)

Status of this Memo

This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress".

To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast).

Distribution of this document is unlimited. Please send comments to the WInter mailing list at <winter@dorado.crpht.lu>. Information about the WInter mailing list, including subscription details are in the WInter Page at:
http://www.crpht.lu/~carrasco/winter

Abstract

This document discusses the Internationalization & Multilinguism of the Web. A Web capable of supporting different cultures, natural languages and Language Engineering facilities such as Parallel Texts. Internationalization permeates most subsystems: client, transmission, server, data and authoring; the primitive mechanism for WIntering should be part of the Web foundations.

1. Introduction
  1.1 Mandate
  1.2 Writing style
  1.3 Terminology

2. Character Set
  2.1 Back office
  2.2 Front office
  2.3 Multilingual typography
  2.4 The characters in the URL

3. Internationalization & localization
  3.1 Elements of localization
  3.2 Messages as HTML pages

4. Multilinguism

5. Parallel Hypertext
  5.1 Definition
  5.2 Language tags
  5.3 Document request
  5.4 Parallel Hypertext Data Structure (PHDS)
  5.5 Linking strategy
  5.6 Generation of parallel texts
    5.6.1 Language dependent strings
    5.6.2 Language-void document

6. Bidirectionallity (BIDI)
7. The LANG attribute
8. LINKs
9. Multilingual thesaurus
10. Electronic Data Interchange (EDI)
11. Passing selected text to a CGI
12. Reference model for Internationalization & Multilinguism
13. VRML
14. Java

15. Dragoman
  15.1 Interactive Search
  15.2 The Translation Folder (full preprocessing)
  15.3 Preprocessing for Machine Translation
  15.4 Machine Translation
  15.5 Pseudo-Automatic Translation (PAT)
  15.6 Document Generation
  15.7 Document Comparison
  15.8 Author's Workbench
  15.9 Terminology Verification
  15.10 Multilingual Aligned Text Editor
  15.11 Printing

16. Acknowledgments
17. Bibliography
18. Author Address

1. Introduction

The intention of this document is to consider all aspects for WIntering. It aims to fulfill two functions:

A catalogue of issues
A primer

To a very large extend, it puts together the efforts of other groups. It goes in more details when materials are not covered elsewhere.

An Internationalized & Multilingual Web should have the traditional facilities of Internationalization and more advanced facilities needed for Language Engineering. For example, clients should have a language menu (similar to edit or file menus) that shows in which other linguistic versions the currently displayed document is available; or clients should be capable of displaying and moving in sync side by side, two linguistic versions of the same document.

Another noteworthy characteristic of this manual is that it doesn't always tell the truth. When certain concepts of TEX are introduced informally, general rules will be stated; afterwards you will find that the rules aren't strictly true.
The TEXbook
Donald E. Knuth

The above quote particularly applies to the documents resumed in this document. Though the intention is to make this document selfcontained by resuming or quoting other documents, it is strongly recommended to consult the source documents.

1.1 Mandate

One of the recommendation of the Internationalization Workshop during the Fifth International WWW Conference in Paris on May 6th 1996, was that a document should be maintained to fulfill the purpose described in the above introduction. The author accepted the task and the present document is the result.

1.2 Writing style

A special effort should be made to make this document as accessible as possible to non-computer specialists (e.g., linguists) and non-English native speakers. Due to the characteristics of WInter, there should be a significant number of both. This does not imply that there should be one type of document for each type of participant. It means that this document should be accessible to all participants. Perhaps by adopting a journalistic style and re-stating the evident. The overhead should be small and it is good to avoid misunderstanding, even between people of the same field.

Comments regarding the writing style from journalists or readers with similar profiles are very welcome; i.e., non-computer specialists that have to explain computer materials to other non-computer specialists. Some of the suggestions could be what additional material should be included to make this document more selfcontained; and what terms should be replaced to make it more accessible. But, the gory normative details must be present.

1.3 Terminology

Alignedness It is a quality of Parallel Texts; for example, the Treaty of Rome in English and Spanish are Parallel Texts and they should be aligned. The interesting part is aligning Parallel Texts automatically. Author-Translator-Publisher Chain (ATP-chain) It refers to the integration of all the phases in the production of documents. Usually, large distributed systems. Globalization In the context of electronic commerce, the mechanisms to facilitate global trade. Internationalization & Multilinguism are some of these mechanisms. A legal framework is an example of a non computer mechanism. I18N Abbreviation for Internationalization. The 18 refers to the characters nternationalizatio. Language Engineering Language Engineering is the application of computer science to natural languages. For example: Terminology Translator's Memory Multilingual documentary databases Aligned Text Translator's Workbench Author's Workbench Machine Translation Publishing (in particular, multilingual synchronized publishing) Level of Alignedness This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at: Document level: the trivial case; i.e., Parallel Texts. Paragraph level: not too hard to achieve. Sentence level: desirable and possible to achieve. Term level: it needs tagging for automatic alignedness. Word level: it needs tagging for automatic alignedness. In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects. Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: It is a quality of Parallel Texts; for example, the Treaty of Rome in English and Spanish are Parallel Texts and they should be aligned. The interesting part is aligning Parallel Texts automatically.
Author-Translator-Publisher Chain (ATP-chain) It refers to the integration of all the phases in the production of documents. Usually, large distributed systems. Globalization In the context of electronic commerce, the mechanisms to facilitate global trade. Internationalization & Multilinguism are some of these mechanisms. A legal framework is an example of a non computer mechanism. I18N Abbreviation for Internationalization. The 18 refers to the characters nternationalizatio. Language Engineering Language Engineering is the application of computer science to natural languages. For example: Terminology Translator's Memory Multilingual documentary databases Aligned Text Translator's Workbench Author's Workbench Machine Translation Publishing (in particular, multilingual synchronized publishing) Level of Alignedness This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at: Document level: the trivial case; i.e., Parallel Texts. Paragraph level: not too hard to achieve. Sentence level: desirable and possible to achieve. Term level: it needs tagging for automatic alignedness. Word level: it needs tagging for automatic alignedness. In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects. Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: It refers to the integration of all the phases in the production of documents. Usually, large distributed systems.
Globalization In the context of electronic commerce, the mechanisms to facilitate global trade. Internationalization & Multilinguism are some of these mechanisms. A legal framework is an example of a non computer mechanism. I18N Abbreviation for Internationalization. The 18 refers to the characters nternationalizatio. Language Engineering Language Engineering is the application of computer science to natural languages. For example: Terminology Translator's Memory Multilingual documentary databases Aligned Text Translator's Workbench Author's Workbench Machine Translation Publishing (in particular, multilingual synchronized publishing) Level of Alignedness This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at: Document level: the trivial case; i.e., Parallel Texts. Paragraph level: not too hard to achieve. Sentence level: desirable and possible to achieve. Term level: it needs tagging for automatic alignedness. Word level: it needs tagging for automatic alignedness. In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects. Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: In the context of electronic commerce, the mechanisms to facilitate global trade. Internationalization & Multilinguism are some of these mechanisms. A legal framework is an example of a non computer mechanism.
I18N Abbreviation for Internationalization. The 18 refers to the characters nternationalizatio. Language Engineering Language Engineering is the application of computer science to natural languages. For example: Terminology Translator's Memory Multilingual documentary databases Aligned Text Translator's Workbench Author's Workbench Machine Translation Publishing (in particular, multilingual synchronized publishing) Level of Alignedness This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at: Document level: the trivial case; i.e., Parallel Texts. Paragraph level: not too hard to achieve. Sentence level: desirable and possible to achieve. Term level: it needs tagging for automatic alignedness. Word level: it needs tagging for automatic alignedness. In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects. Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: Abbreviation for Internationalization. The 18 refers to the characters nternationalizatio.
Language Engineering Language Engineering is the application of computer science to natural languages. For example: Terminology Translator's Memory Multilingual documentary databases Aligned Text Translator's Workbench Author's Workbench Machine Translation Publishing (in particular, multilingual synchronized publishing) Level of Alignedness This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at: Document level: the trivial case; i.e., Parallel Texts. Paragraph level: not too hard to achieve. Sentence level: desirable and possible to achieve. Term level: it needs tagging for automatic alignedness. Word level: it needs tagging for automatic alignedness. In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects. Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: Language Engineering is the application of computer science to natural languages. For example:

Terminology
Translator's Memory
Multilingual documentary databases
Aligned Text
Translator's Workbench
Author's Workbench
Machine Translation
Publishing (in particular, multilingual synchronized publishing)
Level of Alignedness This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at: Document level: the trivial case; i.e., Parallel Texts. Paragraph level: not too hard to achieve. Sentence level: desirable and possible to achieve. Term level: it needs tagging for automatic alignedness. Word level: it needs tagging for automatic alignedness. In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects. Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: This is a metric of alignedness. According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at:

Document level: the trivial case; i.e., Parallel Texts.
Paragraph level: not too hard to achieve.
Sentence level: desirable and possible to achieve.
Term level: it needs tagging for automatic alignedness.
Word level: it needs tagging for automatic alignedness.

In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects.
Linguistic Object Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it. Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it.
Multilingual Aligned Text (MAT) A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary. Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: A MAT is a record in a table with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary.
Parallel Texts Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels. WInter It stands for Web Internationalization & Multilinguism.: Texts that are translations of each other. For example, the Treaty of Rome in English and Spanish are Parallel Texts. Parallel Texts could be aligned to several levels.
WInter It stands for Web Internationalization & Multilinguism.: It stands for Web Internationalization & Multilinguism.

2. Character Set

A large character set is a basic prerequisite for having Internationalization & Multilinguism. The bottom line is that the Web must be capable of handling Unicode [UNICODE].

The character set should be considered a low level layer; i.e., like the pieces of wires in the seven layers ISO Reference Model (physical, datalink, network, etc). Other functionalities should be in other layers. There is a tendency in overloading this layer, by opposition to defining new layers.

There are two aspects to the character set:

The Back office It deals with storage in disk, transmission, representation in the document, etc The Front office It is concerned with rendering on the screen or printer.: It deals with storage in disk, transmission, representation in the document, etc
The Front office It is concerned with rendering on the screen or printer.: It is concerned with rendering on the screen or printer.

2.1 Back office

Latin-1[ ISO -8859-1] is the default character set for the Web. Latin-1 is only sufficient for Western European languages. Latin-1 is an 8-bits encoding. This permits a maximum of 256 characters.

Unicode (ISO 10646 BMP) is a large character set that includes most of the world languages. Unicode is a 16-bits encoding. This permits over 65,000 characters. At present, over 25,000 positions are still free. This form is also called UCS-2; i.e., Universal Character Set 2-bytes. Unicode is the first plane of ISO 10646 (see below); this plane is also called BMP (Basic Multilingual Plane) or Plane Zero. The Internationalization of the Hypertext Markup Language [I-HTML] proposes Unicode as the document character set.

ISO 10646 is a 32-bits encoding. It is divided into 32,000 planes, each with 65,000 characters capacity. This permits 2,080 million characters. This form is also called UCS-4, Universal Character Set 4-bytes. Only the first plane (Unicode) is in use.

UTF-8, (Universal Character Set Transformation Format) is an addendum to ISO 10646. It provides compatibility with ASCII and the ASCII characters are represented by 1 byte (8 bits) and not 4 bytes (32 bits). In general, it is economical with the bytes used in the encoding.

[HTTP-1.1] allows for the character set to be negotiated. For example, the client and server can agree on using Unicode.

2.2 Front office

Rendering is drawing the glyphs (graphic representation of the characters) on the screen or printer. This is the job of the browser and the browser depends on the graphical facilities of the computer.

Undisplayable characters are the characters that cannot be displayed due to the lack of facilities. The I-HTML "does not prescribe any specific behavior", but notes some "considerations". WInter recommends the following:

The behavior of undisplayable characters must be controlled by the options setting of the browser
Some options can be combined.
There must be a small Undisplayable Characters Flag in the browser part of the screen, not in the document part. Something similar to the red button indicating that the browser is loading a document, but smaller. The flag must be ON if the current document contains one or more undisplayable characters. The presence or absence of the flag must be user definable.
Undisplayable Character Tolerance is a user definable value in the range from 0 to 10, that signals the behavior of the browser.
0 Undisplayable Character Tolerance means ignore all undisplayable characters.
5 Undisplayable Character Tolerance means a reasonable default warning for undisplayable characters. This behaviour must be defined. For example, show only up to 10 continuous undisplayable characters and try remaps, such as "é" to "e".
10 Undisplayable Character Tolerance means show one Replacement Glyph for each undisplayable character.
The other intermediary values must change gradually.
Undefined Undisplayable Character Tolerance must gravitate towards the default value (5).
The undisplayable characters must be remapable to a user definable Replacement Glyph for example, "_". Or one of several numeric representations; for example, hexadecimal or decimal.
The default Replacement Glyph must occupy approximately the same space as the average glyph in the document. It must be a box containing the Unicode value in hex.

Font Servers could supply the browser with missing glyphs.

2.3 Multilingual typography

{The proposition of Martin Dürst will be resumed here.}

2.4 The characters in the URL

The characters allowed in the URL are a subset of ASCII. URL where supposed to be hidden, but they are very visible and important commercially: firms want to spell their names with accents. The most urgent is to have a large character set for the query part. There have been propositions on using UTF-8. URL needs a lot of work.

3. Internationalization & localization

Internationalized softwares are developed without the cultural characteristics embedded. They can be localized parametrically for different cultures; for example, the same software can run for Germany with the German conventions, or for Italy with the Italian conventions.

Internationalization is a well known field; for example, a significant amount of effort was done during the POSIX (Unix) standardization. The mechanisms must be sufficient for implementing the localizations. Localization itself is usually discussed in other fora; for example, how to represent the date in Germany. Most conventions have been already agreed.

Any number of cultures (real or imaginary) are possible. For example, France, Germany, European Commission. In the case of the European Commission, it has to work in the eleven official languages (including Greek), and with cross-cultural conventions or with the national conventions.

3.1 Elements of localization

Languages Two aspects: Language strings in the software. Data in the document. Example, the software could be in German and the document shown in French. Sorting order Number representation Example, the internal number could be 12345.67 and the external representation could be 12,345.67 or 12.345,67. Date & Time Example, the internal representation could be 19951231 and the external representation could be December 31th 1995, or 31-12-1995. Short quotations Example, "I am a Berliner" (English) <<Je suis un Berlinois>> (French) ,,Ich bin ein Berliner'' (German) The new element <Q> in I-HTML is for this purpose.: Two aspects:

Language strings in the software.
Data in the document.
Example, the software could be in German and the document shown in French.
Sorting order Number representation Example, the internal number could be 12345.67 and the external representation could be 12,345.67 or 12.345,67. Date & Time Example, the internal representation could be 19951231 and the external representation could be December 31th 1995, or 31-12-1995. Short quotations Example, "I am a Berliner" (English) <<Je suis un Berlinois>> (French) ,,Ich bin ein Berliner'' (German) The new element <Q> in I-HTML is for this purpose.
Number representation Example, the internal number could be 12345.67 and the external representation could be 12,345.67 or 12.345,67. Date & Time Example, the internal representation could be 19951231 and the external representation could be December 31th 1995, or 31-12-1995. Short quotations Example, "I am a Berliner" (English) <<Je suis un Berlinois>> (French) ,,Ich bin ein Berliner'' (German) The new element <Q> in I-HTML is for this purpose.: Example, the internal number could be 12345.67 and the external representation could be 12,345.67 or 12.345,67.
Date & Time Example, the internal representation could be 19951231 and the external representation could be December 31th 1995, or 31-12-1995. Short quotations Example, "I am a Berliner" (English) <<Je suis un Berlinois>> (French) ,,Ich bin ein Berliner'' (German) The new element <Q> in I-HTML is for this purpose.: Example, the internal representation could be 19951231 and the external representation could be December 31th 1995, or 31-12-1995.
Short quotations Example, "I am a Berliner" (English) <<Je suis un Berlinois>> (French) ,,Ich bin ein Berliner'' (German) The new element <Q> in I-HTML is for this purpose.: Example,

"I am a Berliner" (English)
<<Je suis un Berlinois>> (French)
,,Ich bin ein Berliner'' (German)
The new element <Q> in I-HTML is for this purpose.

New internationalization elements should be added to this list, for example, color.

The software should be localized from a list of preferred localization, and switchable from one localization to another without re-starting the application.

3.2 Messages as HTML pages

The Status-Code and the Reason-Phrase (see 6.1.1, HTTP-1.1) are presented as HTML pages. These are Language strings in the software but are usually presented as data documents. For example, 404: Not Found.

The localization of the Reason-Phrase can be done by the client or the server. If the client can do a better job, it has to drop the page sent by the server and generate the localized page from the Status-Code and the LANG tag.

4. Multilinguism

Multilinguism deals with advanced language facilities, often several languages simultaneously. It is also referred as Language Engineering. This comes from the tradition of specialized software for Language Engineering, such as Translator's Workbench. One of the main applications is the processing of Parallel Texts.

Most of the softwares in Language Engineering are incompatible and there are practically no standards in this field. Usually, researchers or vendors start from scratch and develop all the modules; even horizontal modules such as user interfaces and data structures, rather than concentrate in the engines for language processing (for aiding the translator, machine translation, etc).

One of the main inmediate objective in Language Engineering must be the creation of standards that clearly separate data and software; i.e., it should be possible to adquire a translation aid program from one vendor and the dictionaries from another vendor.

The purpose is not making every browser a Translator's Workbench, though browsers could do with more advanced language facilities that are usually found in internationalized products. But the standards must allow the construction of Translator's Workbenches based on the Web technology.

After security and the application for secure payment over the Internet, Language Engineering is one of the applications most relevant from an economical point of view; in intranets, with less security requirements, it is probably the most important. It is as horizontal as publishing and, indeed, it is the second phase in the ATP-chain (Author-Translator-Publisher). Translating is expensive and very human intensive. For most texts, machine translation is not acceptable. On the other hand, translating aiding tools are very cost effective. Particularly, if integrated in an ATP-chain. Saving in translating tends to be big.

5. Parallel Hypertext

5.1 Definition

Parallel Hypertext is an extension of the hypertext paradigm to natural languages. For example, a user looking at a document in English should be able to obtain the Spanish version in a transparent way; i.e., just by selecting the Spanish option in a language menu and not by selecting a link embedded in the English version. For this, the Web must know about languages; i.e., the same in another language. The same property of alignedness in Parallel Texts applied to Parallel Hypertext.

5.2 Language tags

The language tags (see 3.10, HTTP-1.1) are composed of a primary language tag and one or more subtags that could be empty.

Examples:

en
en-US
en-cockney

There must be a way to indicate

Human translation
Machine translation
Transliteration

This could be part of a subtag or inside the document.
{Examples will be added.}

5.3 Document request

Clients should be able to request documents at least in the following ways:

A document is requested according to a preference language list that could be the same list used for choosing the display labels in the user interface. The server must respond with best linguistic version and the list of available linguistic versions. The best linguistic version means the nearer to the top of the list and if none is available, the nearer to the top of the defaults in the server. In this case, the browser probably does not know what are the available linguistic versions.
{This will be developed.}
A document is requested in one specific language. The server must respond only with that linguistic version (no other is acceptable) and the list of available linguistic versions. In this case, the client probably knows that the requested version is available; it could be the result of a previous conversation with the server.

Example:

Conversation 1
Client : Give me MyDoc with this order of preference: Danish, English or German
Server : Take MyDoc in German; it is available in German, Italian and Spanish
Conversation 2
Client : Give me MyDoc only in Spanish
Server : Take MyDoc in Spanish; it is available in German, Italian and Spanish

The linguistic versions of the document could be in different servers.

This could be done with the Accept-Language and Content-Language facilities (see 10.4 and 10.11, HTTP-1.1).

The parameter in Accept-Language:

Quality factor "q" is decribed as "... estimate of the user's comprehension of that language ..." . But the user indicates his language preference list and there is no need to use the parameter with this meaning. It would be more usefull to indicate the "minimum acceptable quality of the translation". Some of the translation could be done by more or less experienced translators; or machine translation.

A different usage could be to indicate the level of alignedness.

Maximum acceptable size "mxb" is not used. It could indicate the number of linguistic versions desired.

An Accept-Language with a single language parameter must mean that the browser only wants that linguistic version and not another.

The Content-Language "... describes the natural language(s) of the intended audience ...". The meaning of this field should be "the list of linguistic versions available"; it should be used by the browser to update the language menu, so the user could know which other linguistic versions are available.

5.4 Parallel Hypertext Data Structure (PHDS)

One Parallel Hypertext Data Structure contains all the information for one Parallel Hypertext Document. The Parallel Hypertext Data Structure must allow the following:

Several data schemes. For example, directory, SGML, tar, etc
Keeping the linguistic versions in different servers
Conversation with monolingual clients. In this case, the user must know the structure

The Parallel Hypertext Data Structure has two parts:

The PHDS-Header Contains administrative data. For example, where is the German linguistic version. The data is divided into structured fields. The PHDS-Body Contains the linguistic data. It has one section per language.: Contains administrative data. For example, where is the German linguistic version. The data is divided into structured fields.
The PHDS-Body Contains the linguistic data. It has one section per language.: Contains the linguistic data. It has one section per language.

The PHDS-Header is always a HTML file. This file must fulfill two functions:

Allowing a user to select one linguistic version
Be used by WIntered Web programs (clients/servers) as a datastructure to locate the pertinent linguistic version

The PHDS-Header must contain at least the following information:

Name
DataScheme
DataLocation (for all the parts)

The DataSchema applies only to the PHDS-Body. The PHDS-Header is always a HTML.

{An example of a file in HTML will be added.}

The default for a single set of files is:

DocName.html                              (PHDS-Header)

DocNameDir                                (PHDS-Body, a directory)
           /en.html             English   (PHDS-Body language section)
           /es.html             Spanish   (PHDS-Body language section)
           /de.html             German    (PHDS-Body language section)

The default for several sets of files is:

DocName.html                              (PHDS-Header)

DocNameDir                                (PHDS-Body, a directory)
           /en/DocName1.html    English   (PHDS-Body language section)
           /en/DocName2.html    English   (PHDS-Body language section)

           /es/DocName1.html    Spanish   (PHDS-Body language section)
           /es/DocName2.html    Spanish   (PHDS-Body language section)

           /de/DocName1.html    German    (PHDS-Body language section)
           /de/DocName2.html    German    (PHDS-Body language section)

The DocName.html should be usable directly by the present clients (browsers) and/or indirectly to generate HTML files of the fly. Multilingual clients should use the information to access the documents in a transparent way.

Requesting a URL of a PHDS-Header must get the linguistic version according to the rules of the language preferences. Requesting a URL of a PHDS-Body language section must get that linguistic version.

The server must know at least the following defaults:

language with the explicit links
preferred language list
MAT table

{This will be extended.}

A standard data structure for Parallel Hypertext would be of use for anybody working with Parallel Texts, independently if the Web is used or not. For example, CD-ROMs could be published with Parallel Texts for language processing programs, such as Machine Translation, that would know what to expect. At present, there is no standard for Parallel Texts or MAT.

The relation with Text Encoding Initiative (TEI) will be explored.

5.5 Linking strategy

The linking strategy must minimize the maintenance. This is essential for large multilingual documentary databases. For example, the millions of pages of the European Institutions in eleven languages. Only one linguistic version should have explicit links; i.e., the links as used today that are physically present in the documents. The other linguistic versions would have implicit links; i.e. links that would not be physically present in the texts, but they could be calculated by the alignedness of the different linguistic versions.

The generation of implicit links could be client, server and/or authoring affair:

Client.- A client could receive a linguistic version with explicit links and a linguistic version with implicit links. The client would display the linguistic version with the explicit links or it would calculate the implicit links on the fly and display the result.
Server.- A multilingual server could process documents with implicit links and generate on fly documents with explicit links.
Authoring.- An interactive or batch authoring system could process documents with implicit links and it could create new documents with explicit links; the server would not know how the new documents were created.

These options should be considered as a continuum and (some) are not mutually exclusive: most degrees between the extremes are possible. For example, servers could be able to create documents on the fly and they could be using documents with the links generated by authoring systems. Indeed, a mixture could be the most probable case.

The level of alignedness should be calculated in advance and kept in the Parallel Hypertext Data Structure. Some documents widely regarded as aligned because they were revised over half a dozen time and they have been heavily used for decades (best-case documents); once submitted to a computer program, it came to light that they were not aligned even to paragraph level.

The linked text (i.e., what goes between <a ...> and </a>) would have to be at least to the level to which the texts are aligned. For example, for texts aligned only at paragraph level, it is not possible to calculate implicit links at sentence level. A corollary is that texts aligned at document level can have implicit links only at the beginning or at the end.

The links would have to be at least at sentence level. It would be hard to place implicit links in part of a sentence without tagging: the second text should have null links; named null links if there are several in one sentence.

Examples:

No need for null links in the second text. A whole sentence is linked in the first text and finding the place for the implicit links in the second text is easy.

The white table. <a href="MyURL"> The black table </a> The green table.
La mesa blanca.                   La mesa negra.       La mesa verde.
                 (implicit link)

It needs a null link in the second text. Only part of a sentence is linked in the first text and finding the place for the implicit link in the second text is hard; i.e., it cannot be done with simple strings processing and it needs computational linguistics.
```
The white table. The black <a href="MyURL"> table </a> The green table.
La mesa blanca.  La <a name="Null"> mesa </a> negra.   La mesa verde.
                     (null link)
```

5.6 Generation of parallel texts

The linguistic versions could be generated through machine translation or other techniques. For example, a system could have documents in Spanish and a program for translation to English. The user should be informed by the language menu into which languages and with which techniques (MT, human translator, etc) the documents are available.

{This will be extended.}

5.6.1 Language dependent strings

These are tags to be replaced by language string (Linguistic Object) according to the language requested. For example, the following shows the content of a HTML document and the resulting replacement; assuming that the language requested is German and that the Linguistic Object corresponding to the identifier String_1 is the German phrase below:

 <SomeTag SomeLabel=String_1>

 Ich bin ein Berliner

5.6.2 Language-void document

A document without any language string; i.e., it contains only language dependent strings. In this case, only one HTML document is needed and not one per language; this HTML document could be considered a mask. A database with Linguistic Objects is needed. The same Linguistic Object can be used in several documents.

This technique could be used for the localization of the messages send by the server as HTML documents.

6. Bidirectionallity (BIDI)

(see 4.2, I-HTML)
{A resume from the I-HTML will be inserted.}

7. The LANG attribute

(see 3, I-HTML)
{A resume from the I-HTML will be inserted.}

8. LINKs

<LINK REL=Glossary>
<LINK REL=Dictionary>
<LINK REL=Translation>
{This will be exteneded.}

9. Multilingual thesaurus

This is a tool for finding references to the search in any language. For example, if the string in the search is "table" it should also find the Spanish document with the word "mesa" (table in Spanish).

10. Electronic Data Internchange (EDI)

Many EDI messages are printed. As the EDI messages are very structured, a translation of the message could be shown using Pseudo-Automatic Translation (PAT).

11. Passing selected text to a CGI

To consult terminological databases easly, it should be possible to pass selected string (with the mouse or other) to CGI programs or similar. This is a generic mechanism.

12. Reference model for Internationalization & Multilinguism

This is a very first trial and further work is needed. The model is layered, similar to the seven layers ISO Reference Model (physical, datalink, network, etc). A different approach could be needed; for example, a vector approach.


LayerNumber   LayerName         Example

1             compression       gzip
2             transformation    UTF-8
3             character set     Unicode (65, "LATIN CAPITAL LETTER A")
4             glyph             "A"
5             font              Time

Other items to put into the model:

sorting order
language (e.g., Korean)

There is a general tendency to overload the character set layer. For example, wishing to allocate two code positions to the same ideogram because it means different things in different languages.

13. VRML

How objects negotiate when they speak different languages ?
{This will be developped.}

14. Java

{This will be developped.}

15. Dragoman

This section is included mostly to illustrate the kind of applications for multilinguism.

Dragoman is a reference model for Language Engineering. It uses Multilingual Aligned Hypertext technique. In essence, Dagroman describes a Database (part structured and part documental) and Services that can be implemented over the (multilingual ) Database. Often, different data structures are used for the Services described below.

The Web paradigm is particularly well adapted to Dragoman. The term Dragoman has nothing to do with dragons; it means language interpreter.

What follows is a very brief description of some of the Services that could be implemented over the Database. There could be several programs offering the same Service. Services processing whole documents could be implemented in batch; particularly if they are using a very large Database (several gigabytes).

15.1 Interactive Search

Selects the Multilingual Aligned Texts (MAT) that match a search criteria. The search is fuzzy (e.g. 87% match). Unfound requests are valuable information that must be processed further. The system must keep trace of the unfound requests to put in contact people with similar needs (matchmaker); the user must decide what is a typing error and what is a genuine unfound request. Also the user can send messages to terminologists (demand driven terminology).

15.2 The Translation Folder (full preprocessing)

The objective is to obtain a complete Translation Folder for a given document. Hence, the translator should not need to consult dictionaries, databases, glossaries, nomenclature list, etc. It is like having a hundred assistants preparing the text for the translator. In a typical Translation Folder, some paragraphs should be fully translated and some paragraphs should be a mixture of full sentences, segments, titles, terms, nomenclatures, etc (all these items are packaged as Linguistic Objects); background documents could also be taken into account. The Linguistic Objects are marked with the Status; for example, unverified, verified, compulsory, etc. The search follows a fuzzy biggest chunk heuristic. Traditionally there are two texts, source and target. But there could be any number of language fields. This could be the most useful Service for the translator and it should be implemented early. The translator could use the result on paper or on the screen.

15.3 Preprocessing for Machine Translation

Similar to the Translation Folder. It should be adapted to an (existing) machine translation program that follows up the processing. For example, select only exact matches (no fuzzy) and terms in the unfound phrases; the machine translation program would translate only the unfound phrases.

15.4 Machine Translation

A Machine Translation program that uses the Database directly. For example, a program could combine perfect matches, process the easy fuzzy matches such as dates, pure Machine Translation, etc.

15.5 Pseudo-Automatic Translation (PAT)

Similar to the Translation Folder, but where all the texts are found with a 100% match (no fuzzy search). The program should be restricted to a collection of records; i.e., it should not be allowed to roam the Database as there could be bad surprises. In particular, one must avoid word by word translation; hence one must be very careful with small Multilingual Aligned Texts (for example, a one-word Multilingual Aligned Text).

15.6 Document Generation

All the linguistic versions of a document are generated camera ready. There is no source and translation as such, the index is created, the typesetting (nearly) done. This is the most useful Service for the Organization. It is a very efficient way to produce documents. The three phases Author-Translator-Publisher (ATP-chain) are highly integrated. It is particularly adapted to periodic publications. The production of standardized documents is trivial.

Documents in several linguistic versions are often required to be synchronized; i.e., each page in each linguistic version must contain the same content and the same lay-out (text, number of paragraphs, etc). The typesetting, including the synchronization, must be automated and each page should not be processed by a human; a human operator should intervene only to fine-tune the publication. TeX should be considered.

A document might need several representations; for example, typesetted for the Official Journal, formatted for a CD-ROM or marked in HTML (for CD-ROM or server). First, a document in SGML should be generated; indeed, the SGML document is the document. All the following representations should be created from the SGML document. This method should guarantee that all the representations have the same content.

With such a system in place, the creation of secondary products is easy. For example, a Parliamentary Commission could work with a draft of the Budget typesetted like the Official Journal, in all the linguistic versions, enriched with hidden comments.

15.7 Document Comparison

The user directs the program to a document similar to the one that has to be translated. The new pieces could be fetched in the Database. This program could work without the Database, though the new pieces would not be fetched. Similar translations could arise as a version of a previous document and as a new similar document.

15.8 Author's Workbench

Authors could use a similar technique to Translation Folder and Document Comparison. The unknown parts of the text would be marked and in certain cases alternatives would be proposed. Texts created with the translation phase in mind are easier to translate. Ideally, the author should aim to produce a text for translation with Pseudo-Automatic Translation.

15.9 Terminology Verification

The objective is to verify the Consistency and Harmonization of the terminology. The concepts are closely related and they can be combined, but they are not the same.

Consistency is naming the same object with the same term. It is an internal characteristic of a set of documents (the unitary set is allowed) and it does not need a Database. The more linguistic versions of the set of documents the better.
Harmonization is imposing a term by the Terminological Authority. It is an external characteristic of the document and it needs a Database with the harmonized terms.

15.10 Multilingual Aligned Text Editor

An editor shows at least two (aligned) texts, it moves the texts in sync, it highlights the differences, etc.

15.11 Printing

A program that prints one or several Multilingual Aligned Text side by side. It could be the following step after the Translation Folder. Multilingual Aligned Texts (source and target) on paper allow the translator to use traditional tools such as dictating.

16. Acknowledgments

This document makes heavy use from the documents cited in the texts. Particularly from the relevant RFC and IETF-drafts.

Also from the following:

Web Multilinguism. BOF meeting, Third International WWW Conference
Web Internationalization. BOF meeting, Fourth International WWW Conference
Web Internationalization & Multilinguism. BOF meeting, Fifth International WWW Conference
Internationalization Workshop. Fifth International WWW Conference
WInter mailing list
Informal talks/communications (probably the most fruitful)

The BOF meetings were organized by the author.

Martin Dürst made many suggestions to the position paper of the author for the Internationalization Workshop during the Fifth International WWW Conference. The present document is over 80% based on the position paper. He commented the Reference model and I expect him to come back with further suggestions.

In such fluid circumstances, it is nearly impossible to attribute credits. Though it particularly comes to mind,

Bert Bos
Martin Bryan
Martin Dürst
Albert Lunde
Larry Masinter
Gavin Nicol
Steven Pemberton
Christine Stark
François Yergeau
Faith Zack

The author tries to look for consensus and borrowed heavily from many sources. On the other hand, he is the only responsible for any shortcomings and the opinions expressed.

17. Bibliography

[BRIAN] Martin Bryan, "Using HyTime to Link Translations", contribution to the WInter mailing list, http://www.crpht.lu/~carrasco/winter/hytime.html

[CARRASCO-1] M.T. Carrasco Benitez, "On the multilingual normalization of the Web", Poster for the Third International WWW Conference, http://www.crpht.lu/~carrasco/winter/poster.html

[CARRASCO-2] M.T. Carrasco Benitez, "Web Internationalization", Poster for the Fourth International WWW Conference, http://www.crpht.lu/~carrasco/winter/inter.html

[CARRASCO-3] M.T. Carrasco Benitez, "WInter (Web Internationalization & Multilinguism0", Position paper for the Internationalization Workshop during the Fifth International WWW Conference, http://www.crpht.lu/~carrasco/winter/popa.html

[CONNOLLY] "Character Set Considered Harmful", http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html

[HTML 2.0] T. Berners-Lee, D. Connolly, "HTML 2.0", RFC 1866, http://www.ics.uci.edu/pub/ietf/html/rfc1866.txt

[HTML 3.0] "HTML 3.0", expired Internet-Draft, http://www.hpl.hp.co.uk/people/dsr/html3/CoverPage.html

[HTTP-1.1] R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", Work in progress (draft-ietf-http-v11-spec-01.txt) MIT/LCS, January 1996. http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-01.html ,

[I-HTML] F. Yergeau, G. Nicol, G. Adams, M. Duerts, "Internationalization of the Hypertext Markup Language", Work in progress, (draft-ietf-html-i18n-03.txt) http://www.alis.com:8085/ietf/html/draft-ietf-html-i18n.txt

[ISO-8859-1] ISO 8859-1:1987. International Standard -- Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet No. 1.

[NICOL] G. T. Nicol, "The Multilingual WWW" http://www.ebt.com:8080/docs/multilingual-www.html

[UNICODE] The Unicode Consortium, "The Unicode Standard -- Worldwide Character Encoding -- Version 1.0", Addison-Wesley, Volume 1, 1991, Volume 2, 1992. http://www.unicode.org

[ZACK] F. Zack, "Serving Multilingual Online Documentation", Poster for the Fifth International WWW Conference

{This list will be completed.}

18. Author Address

Manuel Tomas CARRASCO BENITEZ
carrasco@innet.lu
http://www.crpht.lu/~carrasco/winter

WInter (Web Internationalization & Multilinguism)

Status of this Memo

WInter
(Web Internationalization & Multilinguism)