Third International World-Wide Web Conference
10-14 April 1995, Darmstadt, Germany

On the multilingual normalization of the Web

M.T. Carrasco Benitez
Commission of the European Communities

Summary

This document discusses multilingual aspects of the Web. Extensions should be as compatible as possible with the present Web and they should be comprehensive; in particular, they should cover Language Engineering. The two main areas are: 1) Character set; 2) Multilingual Aligned Hypertext.

I will try to organize a meeting during the Third WWW Conference on this subject. If I can find facilities, I will post a time and place in this booch.

1. Definitions

1.1. Language Engineering

Language Engineering covers anything related to languages:

1.2. Parallel Texts

Parallel Texts are linguistic versions of the same text; for example, the Treaty of Rome in English and Spanish are Parallel Texts.

1.3. Alignness

Alignness is a quality of Parallel Texts; for example, the Treaty of Rome in English and Spanish are Parallel Texts and they should be aligned. One can have a guarantee alignness if the Parallel Texts are kept as Linguistic Objects; i.e., keeping them in a structured database. The interesting part is aligning Parallel Texts automatically.

1.4. Level of Alignness

According to which depth it is possible to identify the equivalent string, the texts are aligned at: In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects.

2. Character set

The present Web character set is probably sufficient for English and for some non English documents. It is insufficient for complex multilingual environments; for example, the European Institutions have eleven official languages. For this type of environment, a large character set such as Unicode (ISO 10646 BMP) is needed. Note that the same document could have several languages with different alphabets; for example, a document could have English and Greek: The Greek Commissioner said: <something in Greek>. Unicode is a 16 bits code character set that includes most of the world languages. The 16 bits are the force and the weakness of Unicode. The force because one can represent all the characters; the weakness because it is an overkill for English. Without compression, it duplicates the disk space and the transmission by two. The top 8 bits are not needed for English; i.e., they are set to zero.

A Multilingual Web must be able to process Unicode and the present character set. There should be a way to indicate a Unicode file; for example, a file extension or a magic number (Suggestions ?). The programs should also be capable of mapping Unicode to other character sets; for example, to ASCII, ISO 8859-1, PC1, etc. When the target character set is poorer, the mapping should be reasonable; for example, mapping "é" to "e", or unmappable character to "_" or the code number . The different cases should be considered:

  1. non-multilingual client ---- non-multilingual server
  2. non-multilingual client ---- multilingual server
  3. multilingual client ---- non-multilingual server
  4. multilingual client ---- multilingual server

2.1. HTML tagging with Unicode

The tagging in a Unicode document can be done with one Unicode (16 bits) character; it is one of the few cases where Unicode represents a saving. A region (page or pages) of the coding space should be dedicated to this purpose. It could be considered a HTML Alphabet (Suggestions ?).

2.2. HTTP/1.0

2.2.1. Accept-Charset

"The Accept-Charset header field can be used to indicate a list of preferred character sets other than the default US-ASCII and ISO-8859. This field allows clients capable of understanding more comprehensive or special-purpose character sets to signal that capability to a server which is capable of representing documents in those character sets. ... An example is ... unicode-1-1." (Suggestions ?: Any implementation of this.)

2.2.2. Accept-Encoding

"The Accept-Encoding header field is similar to Accept, but lists the encoding-mechanismes and transfer-encoding values which are acceptable in the response." (Suggestions ?: Unicode specific encoding)

3. Multilingual Aligned Hypertext (MAH)

Multilingual Aligned Hypertext is an extension of the hypertext paradigm to natural languages; for example, a user looking at a document in English should be able to obtain the Spanish equivalent in a transparent way. For this, the Web must know about foreign languages; i.e., the same in another language. A lot can be done without changing HTML and just by implementing clients that know about the structure below.

3.1. Some functionalities

A multilingual Web should have functionalities such as: (Suggestions ? : more functionalities) It is possible to have some Multilingual Aligned Hypertext with the present Web using the structure below and the present character set, but the end user must be aware of the structure and as long as the present character set is sufficient for the languages desired.

3.2. Data structure

A new data structure is needed for Multilingual Aligned Hypertext. The top of the structure is a mahName.html file. The file can describe several schemes: The mah files below could have any URL and the name of the document could be different from the file name (DocName). (Suggestion ? : generalize grouping; i.e. a groupingName.html, rather than a mahName.html file).

The default for a single set of files is:

mahDocName.html

DocName.mah                          (directory)
           /en.html                   English
           /es.html                   Spanish
           /de.html                   German


The default for several sets of files is:

mahDatabaseName.html

DatabaseName.mah                      (directory)
                /en/DocName1.html      English
                /en/DocName2.html      English

                /es/DocName1.html      Spanish
                /es/DocName2.html      Spanish

                /de/DocName1.html      German
                /de/DocName2.html      German
(Suggestions ? : multilingual documentary indexing)

The mahName.html should be usable directly by the present clients (browsers) and/or indirectly to generate html files of the fly. Multilingual clients should use the information to access the documents in a transparent way.

3.3. Anchoring Strategy

The anchoring strategy must minimize the number of anchors and it must allow changing the defaults. Only one linguistic version of the document should have explicit anchors (e.g. English), the other linguistic versions would have implicit anchors; i.e., the anchors should be calculated by the alignness of the different linguistic versions.

The anchors would have to be at least at sentence level. It would be hard to place implicit anchors in part of a sentence without tagging: the second text should have null anchors; named null anchors if there are several in one sentence.

examples:

(Suggestions ? : Module(s) that should be in charge of implicit anchoring)

3.4. HTTP/1.0

3.4.1. Language tag

"A language tag identifies a natural language spoken, written, or otherwise conveyed by human beings for communication of information to other human beings. Computer languages are explicitly excluded. The HTTP/1.0 protocol uses language tags within the Accept-Language and Content-Language header fields."

3.4.2. Accept-Language

"The Accept-Language header field is similar to Accept, but lists the set of natural languages that are preferred as a response to the request."

The server should send one language request, rather than a set of languages as the request should be for the Spanish linguistic version and not for Spanish or German. Also, there should be a mechanism to request the list of available linguistic versions of a document; the client should have language drop menu (or similar) listing the available linguistic versions, rather than anchoring pointing to other linguistic versions.

3.4.3. Link header

"The Link header provides a means for describing a relationship between the entity and some other resource." (Suggestions ?: A language relation ?).

3.4.4. vary (in URI)

(Suggestions ?: Language ?).

3.5. Applications

3.6. Defaults

3.7. Precedence

3.8. mahName.html

SetName = <the name of the set> Scheme = [ DirectorySingleSet | DirectorySeveralSets | sgml ] DataLocation = <a URL>

3.9. Preference

The preference file of the client should include:
LanguagePreference= (ISO-639 are two character codes for languages)

4. Dragoman

Dragoman is a reference model for Language Engineering. It uses Multilingual Aligned Hypertext technique. In essence, Dagroman describes a database (part structured and part documental) and Services that can be implemented over a (multilingual ) Database. The Web paradigm is particularly well adapted to Dragoman. The term Dragoman has nothing to do with dragons; it means language interpreter.

What follows is a very brief description of some of the Services that could be implemented over the Database. There could be several programs offering the same Service. Services processing whole documents could be implemented in batch; particularly if they are using a very large Database (several gigabytes).

4.1. Interactive Search

Selects the Multilingual Aligned Texts that matches a search criteria. The search is fuzzy (e.g. 87% match). Unfound requests are valuable information that must be processed further. The system must keep trace of the unfound requests to put in contact people with similar needs (matchmaker); the user must decide what is a typing error and what is a genuine unfound request. Also the user can send messages to terminologists (demand driven terminology).

4.2. The Translation Folder (full preprocessing)

The objective is to obtain a complete Translation Folder for a given document. Hence, the translator should not need to consult dictionaries, databases, glossaries, nomenclature lists, etc. It is like having a hundred assistants preparing the text for the translator. In a typical Translation Folder, some paragraphs should be fully translated and some paragraphs should be a mixture of full sentences, segments, titles, terms, nomenclatures, etc (all these items are packaged as Linguistic Objects); background documents could also be taken into account. The Linguistic Objects are marked with the Status; for example, unverified, verified, compulsory, etc. The search follows a fuzzy biggest chunk heuristic. Traditionally there are two texts, source and target. But there could be any number of language fields. This could be the most useful Service for the translator and it should be implemented early. The translator could use the result on paper or on the screen.

4.3. Preprocessing for Machine Translation

Similar to the Translation Folder. It should be adapted to an (existing) machine translation program that follows up the processing. For example, select only exact matches (no fuzzy) and terms in the unfound phrases; the machine translation program would translate only the unfound phrases.

4.4. Machine Translation

A Machine Translation program that uses the Database directly. For example, a program could combine perfect matches, process the easy fuzzy matches such as dates, pure Machine Translation, etc.

4.5. Pseudo-Automatic Translation (PAT)

Similar to the Translation Folder, but where all the texts are found with a 100% match (no fuzzy search). The program should be restricted to a collection of records; i.e., it should not be allowed to roam the database as there could be bad surprises. In particular, one must avoid word by word translation; hence one must be very careful with small Multilingual Aligned Texts (for example, a one-word Multilingual Aligned Text).

4.6. Document Generation

All the linguistic versions of a document are generated camera ready. There is no source and translation as such, the index is created, the typesetting (nearly) done. This is the most useful Service for the Organization. It is a very efficient way to produce documents. The three phases Author-Translator-Printer are highly integrated. It is particularly adapted to periodic publications. The production of standardized documents is trivial.

Documents in several linguistic versions are often required to be synchronized; i.e., each page in each linguistic version must contain the same content and the same lay-out (text, number of paragraphs, etc). The typesetting, including the synchronization, must be automated and each page should not be processed by a human; a human operator should intervene only to fine-tune the publication. TeX should be considered.

A document might need several representations; for example, typesetted for the Official Journal and formated for a CD-ROM. First, a document in SGML should be generated; indeed, the SGML document is the document. All the following representations should be created from the SGML document. This method should guarantee that all the representations have the same content.

With such a system in place, the creation of secondary products is easy. For example, a Parliamentary Commission could work with a draft of the Budget typesetted like the Official Journal, in all the linguistic versions, enriched with hidden comments.

4.7. Document Comparison

The user directs the program to a document similar to the one that has to be translated. The new pieces could be fetched in the Database. This program could work without the Database, though the new pieces would not be fetched. Similar translations could arise as a version of a previous document and as a new similar document.

4.8. Author's Workbench

Authors could use a similar technique to Translation Folder and Document Comparison. The unknown parts of the text would be marked and in certain cases alternatives would be proposed. Texts created with the translation phase in mind are easier to translate. Ideally, the author should aim to produce a text for translation with Pseudo-Automatic Translation.

4.9. Terminology Verification

The objective is to verify the Consistency and Harmonization of the terminology. The concepts are closely related and they can be combined, but they are not the same.

Consistency is naming the same object with the same term. It is an internal characteristic of a set of documents (the unitary set is allowed) and it does not need a Database. The more linguistic versions of the set of documents the better.

Harmonization is imposing a term by the Terminological Authority. It is an external characteristic of the document and it needs a Database with the harmonized terms.

4.10. Multilingual Aligned Text Editor

An editor shows at least two (aligned) texts, it moves the texts in sync, it highlights the differences, etc.

4.11. Printing

A program that prints one or several Multilingual Aligned Text side by side. It could be the following step after the Translation Folder. Multilingual Aligned Texts (source and target) on paper allow the translator to use traditional tools such as dictating.

5. Miscellaneous

5.1. Place for the extensions

The term Web is used to name the clients, servers, HTTP, HTML, etc. This undefinition is on purpose, as I do not wish to suggest too specifically where to implement the mechanisms. I give a concrete syntax, but it could be mostly considered as an example to illustrate the functional characteristics. Probably, a multilingual client should be developed.

5.2. Suggestions

The concepts need maturing and suggestions are strongly encouraged, particularly where requested with the label: Suggestions ?

5.3. To do list

5.4. Contact

Mr. Manuel Tomas CARRASCO BENITEZ
Commission of the European Communities
Jean Monnet Building
L-2920 Luxembourg

Telephone : +352 4301 32298
m.carrasco-benitez@mhsg.cec.be
http://www.echo.lu/other/norm.html

From June 1th 1995 (the start of a sabbatical year)
Phone : +352 467303
Fax : +352 467302

5.6. Disclaimer and Copyright

This document represents only the views of the author. The document does not engage in any way the Commission of the European Communities.

Copyright 1994, 1995, the author and the Commission of the European Communities, for the section on Dragoman.

Copyright 1995 the Conference Organizer for the Third International World-Wide Conference.