[Mirrored from: http://www.crpht.lu/~carrasco/winter/draft0.html]
INTERNET-DRAFT M.T. Carrasco Benitez
<draft-benitez-winter-cultures-00.txt>
Expires November 16th 1996 May 16th, 1996
WInter
(Web Internationalization & Multilinguism)
Status of this Memo
This document is an Internet-Draft.
Internet-Drafts are working documents of the
Internet Engineering Task Force (IETF),
its areas, and its working groups.
Note that other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any time.
It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress".
To learn the current status of any Internet-Draft, please check the
"1id-abstracts.txt
"
listing contained in the Internet-Drafts Shadow Directories on
ftp.is.co.za
(Africa),
nic.nordu.net
(Europe),
munnari.oz.au
(Pacific Rim),
ds.internic.net
(US East Coast), or
ftp.isi.edu
(US West Coast).
Distribution of this document is unlimited.
Please send comments to the WInter mailing list at
<winter@dorado.crpht.lu
>.
Information about the WInter mailing list, including subscription details are in
the WInter Page at:
http://www.crpht.lu/~carrasco/winter
This document discusses the Internationalization & Multilinguism of the Web.
A Web capable of supporting different cultures, natural languages and
Language Engineering facilities such as Parallel Texts.
Internationalization permeates most subsystems:
client, transmission, server, data and authoring;
the primitive mechanism for WIntering
should be part of the Web foundations.
1. Introduction
1.1 Mandate
1.2 Writing style
1.3 Terminology
2. Character Set
2.1 Back office
2.2 Front office
2.3 Multilingual typography
2.4 The characters in the URL
3. Internationalization & localization
3.1 Elements of localization
3.2 Messages as HTML pages
4. Multilinguism
5. Parallel Hypertext
5.1 Definition
5.2 Language tags
5.3 Document request
5.4 Parallel Hypertext Data Structure (PHDS)
5.5 Linking strategy
5.6 Generation of parallel texts
5.6.1 Language dependent strings
5.6.2 Language-void document
6. Bidirectionallity (BIDI)
7. The LANG attribute
8. LINKs
9. Multilingual thesaurus
10. Electronic Data Interchange (EDI)
11. Passing selected text to a CGI
12. Reference model for Internationalization & Multilinguism
13. VRML
14. Java
15. Dragoman
15.1 Interactive Search
15.2 The Translation Folder (full preprocessing)
15.3 Preprocessing for Machine Translation
15.4 Machine Translation
15.5 Pseudo-Automatic Translation (PAT)
15.6 Document Generation
15.7 Document Comparison
15.8 Author's Workbench
15.9 Terminology Verification
15.10 Multilingual Aligned Text Editor
15.11 Printing
16. Acknowledgments
17. Bibliography
18. Author Address
The intention of this document is to consider all aspects for
WIntering.
It aims to fulfill two functions:
- A catalogue of issues
- A primer
To a very large extend, it puts together the efforts of other groups.
It goes in more details when materials are not covered elsewhere.
An Internationalized & Multilingual Web
should have the traditional facilities of Internationalization
and more advanced facilities needed for Language Engineering.
For example, clients should have a language menu
(similar to edit or file menus)
that shows in which other linguistic versions
the currently displayed document is available;
or clients should be capable of displaying and moving in sync side by side,
two linguistic versions of the same document.
Another noteworthy characteristic of this manual is that
it doesn't always tell the truth.
When certain concepts of TEX are introduced informally,
general rules will be stated; afterwards you will find that
the rules aren't strictly true.
The TEXbook
Donald E. Knuth
The above quote particularly applies to the documents resumed in this document.
Though the intention is to make this document selfcontained
by resuming or quoting other documents,
it is strongly recommended to consult the source documents.
One of the recommendation of the
Internationalization Workshop during the
Fifth International WWW Conference
in Paris on May 6th 1996,
was that a document should be maintained to
fulfill the purpose described in the above introduction.
The author accepted the task and
the present document is the result.
A special effort should be made to make this document as
accessible as possible to non-computer specialists
(e.g., linguists) and non-English native speakers.
Due to the characteristics of WInter,
there should be a significant number of both.
This does not imply that there should be one type of document
for each type of participant.
It means that this document should be accessible to all participants.
Perhaps by adopting a journalistic style and re-stating the evident.
The overhead should be small and it is good to avoid misunderstanding,
even between people of the same field.
Comments regarding the writing style from journalists
or readers with similar profiles are very welcome;
i.e., non-computer specialists that
have to explain computer materials to other non-computer specialists.
Some of the suggestions could be what additional material should
be included to make this document more selfcontained;
and what terms should be replaced to make it more accessible.
But, the gory normative details must be present.
- Alignedness
-
It is a quality of Parallel Texts;
for example, the Treaty of Rome in English and Spanish are Parallel Texts and
they should be aligned.
The interesting part is aligning Parallel Texts automatically.
- Author-Translator-Publisher Chain (ATP-chain)
-
It refers to the integration of all the phases in the production of documents.
Usually, large distributed systems.
- Globalization
-
In the context of electronic commerce,
the mechanisms to facilitate global trade.
Internationalization & Multilinguism are some of these mechanisms.
A legal framework is an example of a non computer mechanism.
- I18N
-
Abbreviation for Internationalization.
The 18 refers to the characters nternationalizatio.
- Language Engineering
-
Language Engineering is the application of computer science
to natural languages.
For example:
- Terminology
- Translator's Memory
- Multilingual documentary databases
- Aligned Text
- Translator's Workbench
- Author's Workbench
- Machine Translation
- Publishing (in particular, multilingual synchronized publishing)
- Level of Alignedness
-
This is a metric of alignedness.
According to which depth it is possible to identify the Linguistic Objects,
the texts are aligned at:
- Document level: the trivial case; i.e., Parallel Texts.
- Paragraph level: not too hard to achieve.
- Sentence level: desirable and possible to achieve.
- Term level: it needs tagging for automatic alignedness.
- Word level: it needs tagging for automatic alignedness.
In this context, sentence
is a part of a text delimited by a dot, semicolon or similar;
i.e., it has little grammatical meaning and
the main interest is to identify Linguistic Objects.
- Linguistic Object
-
Linguistic Object is a unit of language representation.
It can be a fixed language representation
(term, abbreviation, title, segment, phrase, paragraph, etc)
or meta-language representation (a grammatical construction, etc).
More general, a Linguistic Object is a discrete linguistic unit
(usually a string) whose meaning is created by the program treating it.
- Multilingual Aligned Text (MAT)
-
A MAT is a record in a table with one Linguistic Object per language field
(English, Spanish, German, etc)
that are the equivalence (usually the translation) of each other.
There are other fields for classification and other purposes.
MATs constitute independent elements of a table;
i.e., there is no ordering in the table.
The end result is a data structure similar to a multilingual dictionary.
- Parallel Texts
-
Texts that are translations of each other.
For example, the Treaty of Rome in English and Spanish are Parallel Texts.
Parallel Texts could be aligned to several levels.
- WInter
-
It stands for Web Internationalization & Multilinguism.
A large character set is a basic prerequisite for having
Internationalization & Multilinguism.
The bottom line is that the Web must be capable of handling
Unicode [UNICODE].
The character set should be considered a low level layer;
i.e., like the pieces of wires in the seven layers ISO Reference Model
(physical, datalink, network, etc).
Other functionalities should be in other layers.
There is a tendency in overloading this layer,
by opposition to defining new layers.
There are two aspects to the character set:
- The Back office
-
It deals with storage in disk, transmission, representation in the document, etc
- The Front office
-
It is concerned with rendering on the screen or printer.
Latin-1[
ISO
-8859-1] is the default character set for the Web.
Latin-1 is only sufficient for Western European languages.
Latin-1 is an 8-bits encoding.
This permits a maximum of 256 characters.
Unicode (ISO 10646 BMP)
is a large character set that includes most of the world languages.
Unicode is a 16-bits encoding.
This permits over 65,000 characters.
At present, over 25,000 positions are still free.
This form is also called UCS-2; i.e., Universal Character Set 2-bytes.
Unicode is the first plane of ISO 10646 (see below);
this plane is also called
BMP (Basic Multilingual Plane) or
Plane Zero.
The Internationalization of the Hypertext Markup Language
[I-HTML]
proposes Unicode as the document character set.
ISO 10646 is a 32-bits encoding.
It is divided into 32,000 planes, each with 65,000 characters capacity.
This permits 2,080 million characters.
This form is also called UCS-4, Universal Character Set 4-bytes.
Only the first plane (Unicode) is in use.
UTF-8, (Universal Character Set Transformation Format)
is an addendum to ISO 10646.
It provides compatibility with ASCII
and the ASCII characters are represented by 1 byte (8 bits)
and not 4 bytes (32 bits).
In general, it is economical with the bytes used in the encoding.
[HTTP-1.1] allows for the character set to be negotiated.
For example, the client and server can agree on using Unicode.
Rendering is drawing the glyphs
(graphic representation of the characters)
on the screen or printer.
This is the job of the browser and
the browser depends on the graphical facilities of the computer.
Undisplayable characters
are the characters that cannot be displayed due to the lack of facilities.
The I-HTML
"does not prescribe any specific behavior",
but notes some "considerations".
WInter recommends the following:
- The behavior of undisplayable characters must be controlled by
the options setting of the browser
- Some options can be combined.
- There must be a small Undisplayable Characters Flag
in the browser part of the screen, not in the document part.
Something similar to the red button indicating that the browser
is loading a document, but smaller.
The flag must be ON if the current document contains one or more
undisplayable characters.
The presence or absence of the flag must be user definable.
- Undisplayable Character Tolerance is a user definable value
in the range from 0 to 10,
that signals the behavior of the browser.
- 0 Undisplayable Character Tolerance means ignore all
undisplayable characters.
- 5 Undisplayable Character Tolerance means a reasonable default warning
for undisplayable characters.
This behaviour must be defined.
For example, show only up to 10 continuous undisplayable characters and
try remaps, such as "é" to "e".
- 10 Undisplayable Character Tolerance means show one
Replacement Glyph
for each undisplayable character.
- The other intermediary values must change gradually.
- Undefined Undisplayable Character Tolerance must gravitate towards
the default value (5).
- The undisplayable characters must be remapable to a user definable
Replacement Glyph
for example, "_".
Or one of several numeric representations;
for example, hexadecimal or decimal.
- The default Replacement Glyph must occupy approximately
the same space as the average glyph in the document.
It must be a box containing the Unicode value in hex.
Font Servers could supply the browser with missing glyphs.
{The proposition of Martin Dürst will be resumed here.}
The characters allowed in the URL are a subset of ASCII.
URL where supposed to be hidden,
but they are very visible and important commercially:
firms want to spell their names with accents.
The most urgent is to have a large character set for the query part.
There have been propositions on using UTF-8.
URL needs a lot of work.
Internationalized softwares are developed without
the cultural characteristics embedded.
They can be localized parametrically for different cultures;
for example, the same software can run for
Germany with the German conventions,
or for Italy with the Italian conventions.
Internationalization is a well known field;
for example, a significant amount of effort was done
during the POSIX (Unix) standardization.
The mechanisms must be sufficient for implementing the localizations.
Localization itself is usually discussed in other fora;
for example, how to represent the date in Germany.
Most conventions have been already agreed.
Any number of cultures (real or imaginary) are possible.
For example, France, Germany, European Commission.
In the case of the European Commission,
it has to work in the eleven official languages (including Greek),
and with cross-cultural conventions or with the national conventions.
- Languages
-
Two aspects:
- Language strings in the software.
- Data in the document.
Example, the software could be in German and
the document shown in French.
- Sorting order
-
- Number representation
-
Example, the internal number could be 12345.67 and
the external representation could be 12,345.67 or 12.345,67.
- Date & Time
-
Example, the internal representation could be 19951231 and
the external representation could be December 31th 1995, or 31-12-1995.
- Short quotations
-
Example,
- "I am a Berliner" (English)
- <<Je suis un Berlinois>> (French)
- ,,Ich bin ein Berliner'' (German)
The new element <Q> in I-HTML is for this purpose.
New internationalization elements should be added to this list,
for example, color.
The software should be localized from a list of preferred localization, and
switchable from one localization to another without re-starting the application.
The Status-Code and the Reason-Phrase (see 6.1.1, HTTP-1.1)
are presented as HTML pages.
These are Language strings in the software but
are usually presented as data documents.
For example, 404: Not Found.
The localization of the Reason-Phrase can be done by the client or the server.
If the client can do a better job,
it has to drop the page sent by the server and
generate the localized page from the Status-Code and the LANG tag.
Multilinguism deals with advanced language facilities,
often several languages simultaneously.
It is also referred as Language Engineering.
This comes from the tradition of specialized software for
Language Engineering, such as Translator's Workbench.
One of the main applications is the processing of
Parallel Texts.
Most of the softwares in Language Engineering are incompatible and
there are practically no standards in this field.
Usually, researchers or vendors start from scratch and
develop all the modules;
even horizontal modules such as user interfaces and data structures,
rather than concentrate in the engines for language processing
(for aiding the translator, machine translation, etc).
One of the main inmediate objective in Language Engineering
must be the creation of standards that clearly separate data and software;
i.e., it should be possible to adquire a translation aid program from one vendor and
the dictionaries from another vendor.
The purpose is not making every browser a Translator's Workbench,
though browsers could do with more advanced language facilities that
are usually found in internationalized products.
But the standards must allow the construction of Translator's
Workbenches based on the Web technology.
After security and the application for secure payment over the Internet,
Language Engineering is one of the applications most relevant
from an economical point of view;
in intranets, with less security requirements,
it is probably the most important.
It is as horizontal as publishing and,
indeed, it is the second phase in the ATP-chain (Author-Translator-Publisher).
Translating is expensive and very human intensive.
For most texts, machine translation is not acceptable.
On the other hand,
translating aiding tools are very cost effective.
Particularly, if integrated in an ATP-chain.
Saving in translating tends to be big.
Parallel Hypertext is an extension of the hypertext paradigm
to natural languages.
For example,
a user looking at a document in English should be able
to obtain the Spanish version in a transparent way;
i.e., just by selecting the Spanish option in a language menu
and not by selecting a link embedded in the English version.
For this, the Web must know about languages;
i.e., the same in another language.
The same property of alignedness in Parallel Texts applied to Parallel Hypertext.
The language tags (see 3.10, HTTP-1.1) are composed of a primary language tag
and one or more subtags that could be empty.
Examples:
en
en-US
en-cockney
There must be a way to indicate
- Human translation
- Machine translation
- Transliteration
This could be part of a subtag or inside the document.
{Examples will be added.}
Clients should be able to request documents at least in the following ways:
- A document is requested according to a preference language list that
could be the same list used for choosing the display labels in the user interface.
The server must respond with best linguistic version and
the list of available linguistic versions.
The best linguistic version means the nearer to the top of the list and
if none is available,
the nearer to the top of the defaults in the server.
In this case,
the browser probably does not know what are the available linguistic versions.
{This will be developed.}
- A document is requested in one specific language.
The server must respond only with that linguistic version
(no other is acceptable) and
the list of available linguistic versions.
In this case,
the client probably knows that the requested version is available;
it could be the result of a previous conversation with the server.
Example:
- Conversation 1
Client : Give me MyDoc with this order of preference: Danish, English or German
Server : Take MyDoc in German; it is available in German, Italian and Spanish
- Conversation 2
Client : Give me MyDoc only in Spanish
Server : Take MyDoc in Spanish;
it is available in German, Italian and Spanish
The linguistic versions of the document could be in different servers.
This could be done with the Accept-Language and Content-Language
facilities (see 10.4 and 10.11, HTTP-1.1).
The parameter in Accept-Language:
Quality factor "q"
is decribed as
"... estimate of the user's comprehension of that language ..." .
But the user indicates his language preference list and
there is no need to use the parameter with this meaning.
It would be more usefull to indicate the
"minimum acceptable quality of the translation".
Some of the translation could be done by more or less experienced
translators; or machine translation.
A different usage could be to indicate the level of alignedness.
Maximum acceptable size "mxb" is not used.
It could indicate the number of linguistic versions desired.
An Accept-Language with a single language parameter must mean that
the browser only wants that linguistic version and not another.
The Content-Language
"... describes the natural language(s) of the intended audience ...".
The meaning of this field should be
"the list of linguistic versions available";
it should be used by the browser to update the language menu,
so the user could know which other linguistic versions are available.
One Parallel Hypertext Data Structure
contains all the information for one Parallel Hypertext Document.
The Parallel Hypertext Data Structure must allow the following:
- Several data schemes. For example, directory, SGML, tar, etc
- Keeping the linguistic versions in different servers
- Conversation with monolingual clients.
In this case, the user must know the structure
The Parallel Hypertext Data Structure has two parts:
- The PHDS-Header
-
Contains administrative data.
For example, where is the German linguistic version.
The data is divided into structured fields.
- The PHDS-Body
-
Contains the linguistic data.
It has one section per language.
The PHDS-Header is always a HTML file.
This file must fulfill two functions:
- Allowing a user to select one linguistic version
- Be used by WIntered Web programs (clients/servers) as a datastructure
to locate the pertinent linguistic version
The PHDS-Header must contain at least the following information:
- Name
- DataScheme
- DataLocation (for all the parts)
The DataSchema applies only to the PHDS-Body.
The PHDS-Header is always a HTML.
{An example of a file in HTML will be added.}
The default for a single set of files is:
DocName.html (PHDS-Header)
DocNameDir (PHDS-Body, a directory)
/en.html English (PHDS-Body language section)
/es.html Spanish (PHDS-Body language section)
/de.html German (PHDS-Body language section)
The default for several sets of files is:
DocName.html (PHDS-Header)
DocNameDir (PHDS-Body, a directory)
/en/DocName1.html English (PHDS-Body language section)
/en/DocName2.html English (PHDS-Body language section)
/es/DocName1.html Spanish (PHDS-Body language section)
/es/DocName2.html Spanish (PHDS-Body language section)
/de/DocName1.html German (PHDS-Body language section)
/de/DocName2.html German (PHDS-Body language section)
The DocName.html should be usable directly by the present clients
(browsers) and/or indirectly to generate HTML files of the fly.
Multilingual clients should use the information to access the
documents in a transparent way.
Requesting a URL of a PHDS-Header must get the linguistic version
according to the rules of the language preferences.
Requesting a URL of a PHDS-Body language section must get that linguistic
version.
The server must know at least the following defaults:
- language with the explicit links
- preferred language list
- MAT table
{This will be extended.}
A standard data structure for Parallel Hypertext
would be of use for anybody working with Parallel Texts,
independently if the Web is used or not.
For example, CD-ROMs could be published with Parallel Texts
for language processing programs,
such as Machine Translation,
that would know what to expect.
At present, there is no standard for Parallel Texts or MAT.
The relation with Text Encoding Initiative (TEI) will be explored.
The linking strategy must minimize the maintenance.
This is essential for large multilingual documentary databases.
For example, the millions of pages of the European Institutions
in eleven languages.
Only one linguistic version should have explicit links;
i.e., the links as used today that are physically present in the documents.
The other linguistic versions would have implicit links;
i.e. links that would not be physically present in the texts,
but they could be calculated by the alignedness
of the different linguistic versions.
The generation of implicit links could be client,
server and/or authoring affair:
- Client.-
A client could receive a linguistic version with explicit links
and a linguistic version with implicit links.
The client would display the linguistic version with the explicit links
or it would calculate the implicit links on the fly
and display the result.
- Server.-
A multilingual server could process documents with implicit links and
generate on fly documents with explicit links.
- Authoring.-
An interactive or batch authoring system could process documents with
implicit links and
it could create new documents with explicit links;
the server would not know how the new documents were created.
These options should be considered as a continuum and
(some) are not mutually exclusive:
most degrees between the extremes are possible.
For example, servers could be able to create documents on the fly
and they could be using documents with the links generated
by authoring systems.
Indeed, a mixture could be the most probable case.
The level of alignedness should be calculated in advance and kept
in the Parallel Hypertext Data Structure.
Some documents widely regarded as aligned because
they were revised over half a dozen time and
they have been heavily used for decades (best-case documents);
once submitted to a computer program,
it came to light that they were not aligned even to paragraph level.
The linked text
(i.e., what goes between <a ...> and </a>)
would have to be at least to the level to which the texts are aligned.
For example, for texts aligned only at paragraph level,
it is not possible to calculate implicit links at sentence level.
A corollary is that texts aligned at document level can have implicit
links only at the beginning or at the end.
The links would have to be at least at sentence level.
It would be hard to place implicit links in part of a sentence without
tagging:
the second text should have null links;
named null links if there
are several in one sentence.
Examples:
- No need for null links in the second text.
A whole sentence is linked in the first text and finding the place
for the implicit links in the second text is easy.
The white table. <a href="MyURL"> The black table </a> The green table.
La mesa blanca. La mesa negra. La mesa verde.
(implicit link)
- It needs a null link in the second text.
Only part of a sentence is linked in the first text and
finding the place for the implicit link in the second text is hard;
i.e., it cannot be done with simple strings processing and
it needs computational linguistics.
The white table. The black <a href="MyURL"> table </a> The green table.
La mesa blanca. La <a name="Null"> mesa </a> negra. La mesa verde.
(null link)
The linguistic versions could be generated through machine translation
or other techniques.
For example, a system could have documents in Spanish and
a program for translation to English.
The user should be informed by the language menu into which
languages and with which techniques (MT, human translator, etc)
the documents are available.
{This will be extended.}
These are tags to be replaced by language string (Linguistic Object)
according to the language requested.
For example,
the following shows the content of a HTML document and
the resulting replacement; assuming that the language requested is German and
that the Linguistic Object corresponding to the identifier String_1 is
the German phrase below:
<SomeTag SomeLabel=String_1>
Ich bin ein Berliner
A document without any language string;
i.e., it contains only language dependent strings.
In this case, only one HTML document is needed and not one per language;
this HTML document could be considered a mask.
A database with Linguistic Objects is needed.
The same Linguistic Object can be used in several documents.
This technique could be used for the localization of the messages send by
the server as HTML documents.
(see 4.2, I-HTML)
{A resume from the I-HTML will be inserted.}
(see 3, I-HTML)
{A resume from the I-HTML will be inserted.}
<LINK REL=Glossary>
<LINK REL=Dictionary>
<LINK REL=Translation>
{This will be exteneded.}
This is a tool for finding references to the search in any language.
For example, if the string in the search is "table" it should also
find the Spanish document with the word "mesa" (table in Spanish).
Many EDI messages are printed.
As the EDI messages are very structured,
a translation of the message could be shown using
Pseudo-Automatic Translation (PAT).
To consult terminological databases easly,
it should be possible to pass selected string
(with the mouse or other) to CGI programs or similar.
This is a generic mechanism.
This is a very first trial and further work is needed.
The model is layered,
similar to the seven layers ISO Reference Model
(physical, datalink, network, etc).
A different approach could be needed;
for example, a vector approach.
LayerNumber LayerName Example
1 compression gzip
2 transformation UTF-8
3 character set Unicode (65, "LATIN CAPITAL LETTER A")
4 glyph "A"
5 font Time
Other items to put into the model:
- sorting order
- language (e.g., Korean)
There is a general tendency to overload the character set layer.
For example, wishing to allocate two code positions to the same ideogram
because it means different things in different languages.
How objects negotiate when they speak different languages ?
{This will be developped.}
{This will be developped.}
This section is included mostly to illustrate the kind of applications
for multilinguism.
Dragoman is a reference model for Language Engineering.
It uses Multilingual Aligned Hypertext technique.
In essence, Dagroman describes a Database
(part structured and part documental) and
Services that can be implemented over the (multilingual ) Database.
Often, different data structures are used for the Services described
below.
The Web paradigm is particularly well adapted to Dragoman.
The term Dragoman has nothing to do with dragons;
it means language interpreter.
What follows is a very brief description of some of the Services that
could be implemented over the Database.
There could be several programs offering the same Service.
Services processing whole documents could be implemented in batch;
particularly if they are using a very large Database (several gigabytes).
Selects the Multilingual Aligned Texts (MAT) that match a search criteria.
The search is fuzzy (e.g. 87% match).
Unfound requests are valuable information that must be processed further.
The system must keep trace of the unfound requests
to put in contact people with similar needs (matchmaker);
the user must decide what is a typing error and
what is a genuine unfound request.
Also the user can send messages to terminologists (demand driven terminology).
The objective is to obtain a complete Translation Folder for a given document.
Hence, the translator should not need to consult dictionaries, databases,
glossaries, nomenclature list, etc.
It is like having a hundred assistants preparing the text for the translator.
In a typical Translation Folder,
some paragraphs should be fully translated and
some paragraphs should be a mixture of
full sentences, segments, titles, terms, nomenclatures, etc
(all these items are packaged as Linguistic Objects);
background documents could also be taken into account.
The Linguistic Objects are marked with the Status;
for example, unverified, verified, compulsory, etc.
The search follows a fuzzy biggest chunk heuristic.
Traditionally there are two texts, source and target.
But there could be any number of language fields.
This could be the most useful Service for the translator and
it should be implemented early.
The translator could use the result on paper or on the screen.
Similar to the Translation Folder.
It should be adapted to an (existing) machine translation program that
follows up the processing.
For example, select only exact matches (no fuzzy) and
terms in the unfound phrases;
the machine translation program would translate only the unfound phrases.
A Machine Translation program that uses the Database directly.
For example, a program could combine perfect matches,
process the easy fuzzy matches such as dates, pure Machine Translation, etc.
Similar to the Translation Folder,
but where all the texts are found with a 100% match (no fuzzy search).
The program should be restricted to a collection of records;
i.e., it should not be allowed to roam the Database as
there could be bad surprises.
In particular, one must avoid word by word translation;
hence one must be very careful with small Multilingual Aligned Texts
(for example, a one-word Multilingual Aligned Text).
All the linguistic versions of a document are generated camera ready.
There is no source and translation as such, the index is created,
the typesetting (nearly) done.
This is the most useful Service for the Organization.
It is a very efficient way to produce documents.
The three phases Author-Translator-Publisher (ATP-chain) are highly integrated.
It is particularly adapted to periodic publications.
The production of standardized documents is trivial.
Documents in several linguistic versions are often required to be synchronized;
i.e., each page in each linguistic version must contain the same content and
the same lay-out (text, number of paragraphs, etc).
The typesetting, including the synchronization,
must be automated and each page should not be processed by a human;
a human operator should intervene only to fine-tune the publication.
TeX should be considered.
A document might need several representations;
for example, typesetted for the Official Journal,
formatted for a CD-ROM
or marked in HTML (for CD-ROM or server).
First, a document in SGML should be generated;
indeed, the SGML document is the document.
All the following representations should be created from the SGML document.
This method should guarantee that all the representations have the same content.
With such a system in place,
the creation of secondary products is easy.
For example, a Parliamentary Commission could work with
a draft of the Budget typesetted like the Official Journal,
in all the linguistic versions, enriched with hidden comments.
The user directs the program to a document similar to the one that
has to be translated.
The new pieces could be fetched in the Database.
This program could work without the Database,
though the new pieces would not be fetched.
Similar translations could arise as a version of a previous document and
as a new similar document.
Authors could use a similar technique to Translation Folder and
Document Comparison.
The unknown parts of the text would be marked and
in certain cases alternatives would be proposed.
Texts created with the translation phase in mind are easier to translate.
Ideally, the author should aim to produce a text for translation with
Pseudo-Automatic Translation.
The objective is to verify the Consistency and Harmonization of the terminology.
The concepts are closely related and they can be combined,
but they are not the same.
- Consistency is naming the same object with the same term.
It is an internal characteristic of a set of documents
(the unitary set is allowed) and
it does not need a Database.
The more linguistic versions of the set of documents the better.
- Harmonization is imposing a term by the Terminological Authority.
It is an external characteristic of the document and
it needs a Database with the harmonized terms.
An editor shows at least two (aligned) texts,
it moves the texts in sync, it highlights the differences, etc.
A program that prints one or several Multilingual Aligned Text side by side.
It could be the following step after the Translation Folder.
Multilingual Aligned Texts (source and target) on paper allow the translator
to use traditional tools such as dictating.
This document makes heavy use from the documents cited in the texts.
Particularly from the relevant RFC and IETF-drafts.
Also from the following:
- Web Multilinguism. BOF meeting, Third International WWW Conference
- Web Internationalization. BOF meeting, Fourth International WWW Conference
- Web Internationalization & Multilinguism. BOF meeting, Fifth International WWW Conference
- Internationalization Workshop. Fifth International WWW Conference
- WInter mailing list
- Informal talks/communications (probably the most fruitful)
The BOF meetings were organized by the author.
Martin Dürst made many suggestions to the position paper
of the author for the
Internationalization Workshop during the Fifth International WWW Conference.
The present document is over 80% based on the position paper.
He commented the Reference model and
I expect him to come back with further suggestions.
In such fluid circumstances, it is nearly impossible to attribute credits.
Though it particularly comes to mind,
Bert Bos
Martin Bryan
Martin Dürst
Albert Lunde
Larry Masinter
Gavin Nicol
Steven Pemberton
Christine Stark
François Yergeau
Faith Zack
The author tries to look for consensus and borrowed
heavily from many sources.
On the other hand, he is the only responsible
for any shortcomings and the opinions expressed.
[BRIAN]
Martin Bryan,
"Using HyTime to Link Translations",
contribution to the WInter mailing list,
http://www.crpht.lu/~carrasco/winter/hytime.html
[CARRASCO-1]
M.T. Carrasco Benitez,
"On the multilingual normalization of the Web",
Poster for the Third International WWW Conference,
http://www.crpht.lu/~carrasco/winter/poster.html
[CARRASCO-2]
M.T. Carrasco Benitez,
"Web Internationalization",
Poster for the Fourth International WWW Conference,
http://www.crpht.lu/~carrasco/winter/inter.html
[CARRASCO-3]
M.T. Carrasco Benitez,
"WInter (Web Internationalization & Multilinguism0",
Position paper for the Internationalization Workshop during the
Fifth International WWW Conference,
http://www.crpht.lu/~carrasco/winter/popa.html
[CONNOLLY]
"Character Set Considered Harmful",
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html
[HTML 2.0]
T. Berners-Lee,
D. Connolly,
"HTML 2.0",
RFC 1866,
http://www.ics.uci.edu/pub/ietf/html/rfc1866.txt
[HTML 3.0]
"HTML 3.0",
expired Internet-Draft,
http://www.hpl.hp.co.uk/people/dsr/html3/CoverPage.html
[HTTP-1.1]
R.T. Fielding,
H. Frystyk Nielsen, and
T. Berners-Lee,
"Hypertext Transfer Protocol -- HTTP/1.1",
Work in progress
(draft-ietf-http-v11-spec-01.txt)
MIT/LCS, January 1996.
http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-v11-spec-01.html
,
[I-HTML]
F. Yergeau,
G. Nicol,
G. Adams,
M. Duerts,
"Internationalization of the Hypertext Markup Language",
Work in progress,
(draft-ietf-html-i18n-03.txt)
http://www.alis.com:8085/ietf/html/draft-ietf-html-i18n.txt
[ISO-8859-1]
ISO 8859-1:1987.
International Standard --
Information Processing --
8-bit Single-Byte Coded Graphic Character Sets --
Part 1: Latin Alphabet No. 1.
[NICOL]
G. T. Nicol,
"The Multilingual WWW"
http://www.ebt.com:8080/docs/multilingual-www.html
[UNICODE]
The Unicode Consortium,
"The Unicode Standard -- Worldwide Character Encoding -- Version 1.0",
Addison-Wesley, Volume 1, 1991, Volume 2, 1992.
http://www.unicode.org
[ZACK]
F. Zack,
"Serving Multilingual Online Documentation",
Poster for the Fifth International WWW Conference
{This list will be completed.}
Manuel Tomas CARRASCO BENITEZ
carrasco@innet.lu
http://www.crpht.lu/~carrasco/winter