Web Internationalization & Multilingualism Symposium

[Mirrored from: http://www2.echo.lu/oii/en/w3c-int.html]

Web Internationalization & Multilingualism Symposium
Seville, 20-22 November 1996

This symposium, which was organized jointly by Sadiel S.A., on behalf of the European Commission, and the World Wide Web Consortium (W3C), was designed to promote the advancement of internationalization and multilingualism on the World Wide Web (WWW) provided by the Internet. The symposium was split into 5 half-day tracks:

Social, Political and Cultural Aspects
Basic Infrastructure
Authoring
Site Development
Deployment

The symposium was opened by José Carlos Alarcón, Councillor for Work and Industry in the Andulusian Region responsible for Information and Telecommunications Technologies, who stressed the importance of language-specific IT in the development of the remoter regions of the European Union.

Social, Political and Cultural Aspects

The first session covered the social, political and cultural constraints on the development and use of the Internet. The following papers were presented:

Multilingualism, a topic of the globalization issues of electronic commerce (Patrice Husson, EC DGIII)
Standardization in the Information Society (Anne Lehouck, EC DGIII)
Multilingualism and democracy in cyberspace (Ana L. Valdes)
Panorama of language engineering R&D activities (Iain Urquhart, EC DGXIII)
Multilingual experience in Canada (Yvan Lauzon, Government of Quebec)
Multilingual requirements of Mediterranean and Arab speaking countries (Fatma Fekih-Ahmed, IRSIT, Tunisia)
Multilingual experience in a multinational project of Web, FTP and CD-ROM development (Marc Levilion)

Patrice Husson discussed the multilingual requirements of electronic commerce. Individuals or small and medium-sized enterprises (SMEs) need to be able to access the information superhighway using their own language through technology adapted to the cultural and administrative processes of the user's country.

Directorate General III (Industry) of the European Commission (EC DGIII) is leading the work of the G7 countries on the systematic aspects of linking SMEs, including staff training, how to identify partners, how to provide multlingual information, and the legal issues of using the Internet.

Anne Lehouck from the service of DG III supporting European standardization pointed out that standardization is the role of industry. There needs to be a clear split between the role of regulatory authorities and industry. Public authorities can help to set up suitable environments for industry players to get together, and can help to ensure that all parties are properly represented.

DGIII is working on an Electronic Commerce Action Plan that will provide interoperable building blocks that will allow existing tools to be used for electronic commerce. Electronic commerce is defined as anything concerned with "doing business electronically" and is much broader than existing EDI standardization.

Ana L. Valdes stressed that we need to remember that we are increasingly working in a multicultural world. A new class struggle is developing - a struggle for information. Countries and areas with poor communications are losing out on development possibilities. We need to look carefully at how to train minority groups to benefit from the Internet.

Iain Urquhart from Directorate General XIII (Telecommunications) gave an overview of the activities of the Commission's Language Engineering unit. Language Engineering was concerned with language in all its forms, text, speech and even aspects of image handling. Although the Web was mainly concerned, for now, with text and images the importance of speech as a highly natural form of communication should not be under-estimated, especially in addressing problems of social exclusion.

To date R&D into machine translation has not proved very cost effective. Today more stress is being placed on developing agents for the automatic extraction of reusable terminology databases, and for indexing and retrieving documents based on terminology usage.

A Call for Proposals will be issued on December 17 with the aim of starting a futher series of projects. Pilot applications should aim either to integrate well mastered language engineering technology into business systems, or to develop more novel technologies. Major areas of interest were electronic commerce, Internet and WWW based information and communication services, measures to make our very diverse European cultural heritage available on-line, and research in a number of areas such as intelligent language aware agents and the exploitation of existing text bases. New entrants, especially from the new Member States, are particularly welcome.

Yvan Lauzan, the EDI coordinator for the Government of Quebec, provided a Canadian view of the benefits to be gained from adopting a multilingual approach to the use of the electronic commerce on the Internet. The UN Trade Facilitation Process has already standardized some 200 forms for electronic commerce in a form that is highly multilingual, based on 9 languages, including Arabic, Russian (Cyrillic) and Chinese. The UN EDIFACT database provides a data dictionary for some 7000 pieces of business information. Canada has been developing the French version of this database for the UN.

Fatma Fekih-Ahmed of the Institut Regional des Sciences Informatiques et de Telecommunications (IRSIT) in Tunisia pointed out that Arabic countries are actively working on improving interoperability of their X.25-based networks. Tunisia has been connected to the WWW since 1991, and have had a specialized agency for the WWW since 1995. Eygpt and Morrocco have had similar experiences. IRSIT is using the proposed multilingual HTML specifiation to make bi-lingual text available over the Internet.

Arabized tools exist but are not widely used as the character sets they are based on are often not interoperable. Typically images are used to ensure interoperability over the WWW.

A conference on Telematics and the Arab World will be held, under the auspices of UNESCO, in Tunis in the second quarter of 1997.

Marc Levilion, who is a member of AFNOR's CGTI/CN21 study group, highlighted his experiences as part of a joint EU and Canadian project on Telecommunications Requirements for International EDI (TRIEDI). The project requires CD-ROMs to be connected to network data, and specifically looked at the the management of document translation and translation validation.

Automatic translation can highlight bad syntactical constructs in source documents. As documents may have ambiguities that could be translated in different ways it is vital that a human validator is used to correct automatic translations. The validator must, however, be provided with a terminology base that explains the correct vocabulary for the relevant domain to ensure that s/he does not automatically assume that the correct choice has been made during translation.

Basic Infastructure

Larry Masinter from the Xerox Palo Alto Research Center took on the task of explaining those aspects of the WWW infrastructure that were created to provide services for internationalization and multilingualism.

The Internet provides for multiple mixed modes of communication, including those based on publish & retrieve (HTTP), send & receive (SMTP), broadcast & filter (NNTP) and realtime interaction (RTP) scenarios. Any data labelling mechanism for the Internet needs to be able to cope with all of these modes.

The HyperText Markup Language (HTML) is just one of the media types recognized by the Mulitpurpose Internet Mail Exchange (MIME). MIME was originally designed to allow non-textual data to be incorporated into e-mail transmitted using SMTP, but has since become a general purpose mechanism for identifying the components of a multimedia message sent over the Internet using protocols such as HTTP. As well as identifying the type of data being transmitted, MIME contains optional fields for identifying the character set used to encode text and which language the content is written in.

It is not possible to predefine which character set should be used for communication over the Internet. Originially the WWW was constrained to using the basic 7-bit ASCII Latin character set. This was later extended to the full 8-bit ISO 8859-1 Latin 1 character set, which covers the requirements of most American and Western European languages. For other languages ad hoc solutions were adopted. The development of ISO 10646 (also known as Unicode) has made it possible to provide a unified solution for future software developments, but there is still a need to support the other character sets already adopted by existing user communities.

The latest version of the HTTP specification (HTTP 1.1) allows users requesting data to specify what languages they are able to accept data in, and to qualify this by a factor that indicates the acceptability of this language to them. The protocol also allows servers to specify which alternative languages the data they have returned is available in.

In addition to identifying character set use in the HTTP/MIME header, details of the character set used to encode the document can be embedded into the metadata stored in the header of an HTML file. (Where there is more than one source for this information the information in the MIME header is used.) The language used for each element that makes up an HTML document can be indicated using a LANG attribute that contains a ISO 639 identifier for the relevant language. At present, however, there is no mechanism by which the Cascading Style Sheets (CSS) used to control the presentation of HTML files can use the value of the LANG attribute to control the way an element's contents are presented to users.

MIME headers need to be extended to allow information related to the Platform for Internet Content Selection (PICS) rating of data to be used to control content selection. At present, however, there is no mechanism by which headers can be signed, so they cannot safely be used to transmit secure data, and the PICS spec is based on the concept of a signed message confirming the suitability of the data.

Another area of concern is the extension of the existing mechanism for Uniform Resource Locators (URL's) to allow data to be identified in ways that are not specific to the machine it is located on, or which are not dependent on the use of Latin characters to identify a server. Two main concepts are being discussed. Uniform Resource Numbers (URNs) would allow data to be identified using a unique reference number, such as an ISBN number, which would be language independent. Uniform Resource Characteristics (URCs) would use a number of different properties to identify a document. For example, it is proposed that there should be fields for identifying the genre of the document, its title, its author, its date and its publisher.

URLs need to be global in scope, parsable, transportable in different contexts, and extensible. As such they are not persistent. A persistent URL (PURL) would need to be held in a "library" that could redirect users to the current storage location of any document known to the library.

Martin Dürst from the University of Zurich discussed the problems associated with the internationalization of Univeral Resource Identifiers (URIs), which can be either URLs, URNs or URCs. There is a need to distinguish between references to static data and to dynamically generated data sources. For static data, identifiers need to be both meaningful and easily transcribable. It has been claimed that everybody should be able to copy and keyboard every URL, which would mean that URLs had to be restricted to a subset of ASCII that provided the greatest common denominator (GCD) for existing code sets. The question that then arises is "Why should a URL for (for example) a Japanese document be in a form that a non-Japanese reader could enter?" URL entry should be a matter of relative effort for the majority of users of the document concerned, with a GCD option possibly being provided as a fallback for when keyboard entry of the standard URL is not possible. It should be noted, however, that machines connected to the web can always call up services to display an appropriate keyboard layout from which to select the characters required for a particular URL.

For dynamically generated information the URL is typically used as a form of indexing. There is no need for such URLs to be human readable. The main advantage of using "human readable" URLs is that they allow users used to English to guess how they can get to standard sets of data, such as the home page or FAQ list of a site.

URLs follow a deterministic model; the file is either where specified or it is no longer available. When directory services are used to identify files the identification criteria are likely to identify more than one file, from which users need to select alternatives.

If the current restrictions on the use of non-ASCII characters in URLs were removed, there would need to be a mechanism for identifying the character set to be used to return queries to the server as these are typically constrained to be the ASCII character set by the characteristics of the underlying database. The current mechanism, however, makes it difficult to define queries in languages based on non-Latin scripts.

Another area of concern is in the contents of forms. Users should be able to complete forms in their own language. Menu options within forms should also be displayed in the local language and character set. Before entries can be returned to the server, however, they will need to be converted into a form suitable for the query processor. At present there are no mechanism for controlling such a conversion process, so queries need to be entered in the code set of the server, which is typically ASCII.

Numeric URNs can improve persistency by making names less meaningful. However, there needs to be a mechanism by which backwards compatibility can be maintained between URLs and URNs so that existing references to documents are not compromised by their being assigned an URN. URNs allow the letters A to F to be used in conjuction with digits so that hexadecimal number forms can be specified in a way compatible with the UTF-7 encoding scheme.

The UTF-8 encoding scheme provides a higher probability of distinction between character and random octets of data. This scheme is particulary effective with ASCII and Latin-1 characters. During 1997 support of UTF-8 will become available in most WWW browsers.

ITU is working on a scheme for transportable Universal Personal Numbers (UPN) that could be used as the basis for identifying people using URNs. How this could be used to automatically connect users to the appropriate file server for the specified document is currently unclear.

Erik van der Poel from Netscape covered Internationalization of LDAP, the lightweight version of the OSI Directory Access Protocol developed for use as part of the WWW. LDAP provides a subset of the OSI X.500 Directory protocol that can be used in conjunction with TCP/IP. An LDAP directory has a tree structure whose nodes have a number of associated attribute/value pairs. Each entry has a unique address known as a distinguished name. Additional attributes can be specified to identify the language of the name so that more than one language-dependent variant of a name can be specified. (This is particularly important to international corporations whose name changes from country to country.) Language-dependent descriptors can also be assigned to file name to distinguish between different uses of the name.

IETF RFC 1738 contains a proposal for the use of User Friendly Names based on Westernized address formats such as Steve Kille, Computer Science, University College, London, GB. Internationalization of this proposal would, however, introduce problems as address formats in many parts of the world are radically different.

One problem with adopting any new methodlogy for naming data sources is how they will be represented in the links stored within HTML documents, which at present require the use of a single reference to a uniquely referenced URL or to a URL located relative to the file containing the link. Until the linking mechanism is decoupled from the naming mechanism it will be difficult to envisage how changes to file naming can take place globally.

Authoring

François Yergau of ALIS Technologies, Montreal, introduced the efforts that have been made to internationalize HTML through the Internationalization ERB of W3C, the technical organizers for the symposium. An IETF RFC on the Internationalization of HTML is now awaiting assignment of a formal number by the RFC coordinator before being published. The new draft will extend the character set applicable for HTML 2.0 from the existing Latin 1 set to the full ISO 10646 characacter set, and provide attributes for identifying the language of a document, or its individual elements, together with special elements for identifying embedded sections of text in a different language (<SPAN>) and text with a non-standard writing direction (<BDO>). The extensions include a new quote element (<Q>) that allows the way in which quoted text is to be displayed to be handled in a language specific manner, and options for identifying text that has to be displayed superior to (<SUP>), or as a subscript to (<SUB>), the main text of a line.

Language identification is done using ISO 639 language codes, optionally qualified by ISO 3166 country codes to identify regional variations. Where these two options are insufficient an experimental extension can be defined by requesting IANA to assign an identifier for the language (e.g. x-cherokee).

When creating multilingual documents it is possible to link to a document that uses a different character set from that used in the calling document. For this reason a charset attribute has been added to the anchor (<A>) element to allow users to clearly identify the character set of the file being linked to. The same attribute can also be used within the metadata of an HTML header to allow the character set of the document to be identified in a way that is independent of the HTTP protocol used to request it.

A number of further extensions have been proposed, including ones for handling Ruby ruberics in Japanese and for providing locale specific presentation of dates and currency values. It has also been suggested that hyphenation control should be provided to allow user control of the hyphenation of different languages.

Martin Dürst from the University of Zurich looked at what was missing from the glyph and character sets currently available in HTML. Where local fonts do not support the full ISO 10646 character set the new Bitstream public-domain Cyberbit font can be used to display the character. If the character is not part of the ISO 10646 BMP set the user assigned position for the character in the coded character set must be associated with a presentation glyph or an image that has been developed by the encoder of the document. At present there are no standardized mechanism for identifying and downloading such glyphs or images, though such conversion might be possible at the server prior to transmission of the document. An alternative would be to display the decimal (or hexadecimal) number assigned to unmapped code points, or a character (e.g. a large blob) that indicates that the glyph is not available.

Chris Lilley, who is responsible for font-related matters at the W3C, stressed the need to avoid confusing bytes with characters and glyphs. Bytes are what is sent across the network when a file is interchanged using an encoding such as UTF-8. Characters are what makes up the decoded document that is presented to an application by the file server. Glyphs are what is displayed on a screen when a character has been mapped to a specific font. The HTTP charset property specifies the byte encoding method, while the HTML charset attribute specifies the character set of the document. These are basically 1-to-1 mappings. By contrast, character to glyph maps are n-to-n. For example, ligatures represent multiple characters with a single glyph, while accented characters can be generated by combining two or more glyphs to represent a single character.

Before you can search, hyphenate or transform a document you must convert it from its byte encoded format to its character encoded format. Before you can present a document to a user you must map the characters to glyphs. Cascading style sheets provide methods by which fallback fonts can be specified to allow you to identify which fonts to select glyphs from, e.g. font: Times serif cyberbit will allow characters to be taken from a specified font (Times) if available, from a standardized font class (serif) if they do not occur in the specific font, or from the general purpose ISO 10646 cyberbit font if not part of the standard font repetoire.

The W3C font group hopes to develop a mechanism whereby specific fonts can be requested over the WWW in a way the protects the IPR of the developer (by charging users for their acquisition or use) while at the same time making new fonts easily available to users. This should include a mechanism for embedding fonts within documents in such a way that they cannot be reused in other situations or be extracted from a cache.

To ensure that documents can be formatted without having to receive all of the font data a clear distinction is made between font metadata and font drawing data. The metadata can be embedded within a document header to allow scalable fonts to be used to mimic glyphs until such time that the font drawing data, which is identified using a URL, can be downloaded.

The W3C group will base its work on HP's Panose system as this has been made available copyright free. Font embedding will be based on techniques similar to those employed in PDF. A mechanism for font negotiation using HTTP will be developed, as will mechanisms for requesting a subset of a font to reduce the amount of information that needs to be transmitted to a minimum when large fonts, such as those used of ISO 10646, are requested. It is anticipated that WWW fonts will be usable by other languages, such as Java, VRML and CGM.

IPR protection will be based on digital signatures in conjunction with a machine readable licence and some form of site binding. At present the use of micropayments for usage is not seen as being acceptable to users, but is being considered as a possible method for protecting supplier's rights. A special MIME type will be defined to allow fonts to be clearly identified when transmitted separately from a document.

A first draft of the fonts specification is currently being reviewed.

Gavin Nicol of Electronic Book Technologies in Rhode Island made a plea for the development of a global glyph repository that would allow users to add their own characters to the ISO 10646 set in a manner that would still allow any browser to access them. In particular he was concerned with glyphs used to indicate personal names within Taiwan, which are typically developed specifically for the name by its originator, and with the many characters not covered by the existing BMP specification of ISO 10646, which, for example, only includes a fraction of the 60,000+ possible Chinese characters.

The Text Encoding Initiative (TEI) Writing System Declaration (WSD) suggests how user definable subsets of character sets, which could include names for user defined characters as well as the standard ISO 10646 ones, could be defined. Named characters with uniquely qualified code set identifiers would provide a safer way of identifying characters than character position numbers.

Bert Bos from W3C gave a short overview of the features that are provided for internationalization within the first version of the Cascading Style Sheet proposal, such as the font fallback mechanism mentioned above. He invited attendees to take part in a discussion as to what needs to be added to the second version of the specification to cover the internationalization extensions to HTML on Friday afternoon. This meeting was attended by 6 people, who were able to identify most of the additional requirements.

Martin Bryan from The SGML Centre explained how the ISO 10179 Document Style Semantics and Specification Language (DSSSL) had been designed from the start with the problems of handling multilingual text in mind. He stressed the wide range of control attributes provided for identifying and controlling the glyph sets and writing direction selection mechanisms required for presenting multilingual text. He also stressed how formatting specifications were designed to work irrespective of writing direction, which can be defined as top-to-bottom as well as left-to-right and right-to-left. By adopting terms like start-indent and end-indent, which are independent of writing direction, it has been possible to make a single DSSSL specification for presenting HTML documents work for any language.

A subset of DSSSL designed specifically for the display of SGML documents online, DSSSL-O, has been specified as a starter set for those wishing to develop browsers that can display documents using DSSSL. While the DSSSL-O subset does not include many of the DSSSL options needed to handle multilingual text it could easily be extended to handle multilingual and non-Latin monolingual documents.

A DSSSL-O processor, called JADE, that is capable of converting HTML documents into RTF or other formats is now available in the public domain from http://www.jclark.com.

Martin Bryan also introduced the eXtensible Markup Language (XML) that was formally released in Boston on November 18th by the W3C SGML ERB. XML defines a subset of SGML that is ideally suited for the delivery of SGML documents over the WWW. By requiring that all tags be present, that attributes be presented in their fullest possible forms, and that empty elements be uniquely identifiable, XML allows browsers to be able to present documents they receive without having to fully parse the document type definition (DTD) and check the document instance against the DTD.

XML greatly simplifies the process of parsing incoming SGML documents, allowing browser developers to quickly develop sharable resources. XML documents can contain any element that the user cares to define in an SGML DTD. As such it will allow users currently complaining about the lack of extensibilty in HTML to develop applications that are compatible with general purpose document browsers.

XML is designed to be used in conjunction with DSSSL-O, and a formal specification of how this can be acheived will be developed during 1997, as soon as the ERB has defined a general purpose linking method for XML. This linking mechanism will not only allow style sheets to be linked to XML documents but will also allow XML documents to be linked to other XML and HTML documents using general purpose multi-headed links. It is anticipated that these links will be based on a subset of the mechanisms for locating and linking SGML objects defined in ISO/IEC 10744, the Hypermedia/Time-based Structuring Language (HyTime).

Martin Bryan also briefly introduced a new ISO project, approved at the beginning of November, to define Topic Navigation Maps for associating thesauri and other forms of terminological databases with SGML, XML and HTML encoded documents. Topic navigation maps will also be defined using the constructs provided in HyTime standard. (Experts wishing to take part in the development of this new standard should contact mtbryan@sgml.u-net.com.)

Site Development

Conference chairman Tomas Carrasco Benitez from the European Commission started the session on site development by highlighting the distinction between internationalization (e.g. English OR Greek) and multilingualism (e.g. English AND Greek). Different mechanisms are required to deal with the two approaches. For internationalization to work you need to make users aware that other versions of the same document exist. For mulitlingualism you need a browser that can display all the languages involved at the same time.

In both cases you need to be able to manage the process of requesting and creating the translations, and need to be able to manage the process of moving from text in one language to the equivalent bit of text in another language. For this some form of translator's workbench is required to create the linked language sets, together with browsers that are cabable of switching from one language to another.

Quality of translation is also important. You need to be able to distinguish between machine translations, humanly validated/corrected machine translations and human translations. The HTTP accept-language option has been extended so that users can provide different acceptability factors for each type of translation. Mechanisms should also be provided to allow users to request a translation using either an automated translation facility or by selecting to pay for human translation. Users should be able to specify the timeframe in which the translation must be provided, and the maximum they are prepared to pay for a translation provided in that time.

Both machine and human translations can benefit from preprocessing of files against specialist terminological databases to ensure that technical phrases are translated in a domain-specific way. As such terminological databases tend to be too large to fit on a standard desktop they should either be subsetted into sets usable, at client sites, on a specific set of documents, or should be part of server-based translation services accessible over the web.

It would also be helpful if translators could be warned of problems already encountered by other translators working on a particular source document as such warnings could save a lot of time that is currently wasted in re-researching ambiguities in the source text. For this to be possible translators workbenches must be designed with collaborative working in mind.

Existing translations are often poorly aligned at sentence and paragraph level. A better level of alignedness is required if users are to be able to switch between different language versions in a way that indicates equivalence of text. If paragraph level alignment is not maintained then assigning unique identifiers to paragraph is required to maintain alignment, but if paragraph alignment is maintained alignment can be achieved simply by counting the number of text blocks from the start of the document.

Searching multilingual or internationalized document sets requires the implementation of "concept searches" so that equivalent terms in other languages can be searched for at the same time. Concepts must be domain specific so that alternative possible translations can be identified.

Making machine translation services available through the Internet has some problems associated with it. Such services cannot be provided by Internet Service Providers as part of their mainstream operations because they take up too many CPU cycles. They must be passed to specialist servers. The question then arises as to who pays for such servers, and how. Costs should, ideally, be at a level that requires the use of a micropayment system.

Zeger Karssen of Alpnet presented the views of a user of a translator's workbench. They need to develop tools for job tracking and electronic billing to complement their translation service. They also need a mechanism whereby users can ask for a quick machine translation and then submit this and the source document to them for the generation of a human validated translation. They envisage that within 5 years up to 30% of their work will be generated via the Internet.

Iain Urquhart of EC DGXIII disussed the role of language engineering within electronic commerce. The preponderance of English on the Web, with some estimates ranging as high as 95%, appeared to be at odds with the huge investments in translation and localisation made in the software and localisation industries.

For electronic commerce the benefits of adopting a multilingual approach are easier to identify, but for trade to take place buyers must be able to find, understand and compare information, preferrably in their own language. For this to be possible it is necessary to develop generic business information services.

With no improvement in the quality of machine translation in sight the role of translation bureaus becomes critical. It is important to integrate machine translation and domain-specific terminology sets with authoring tools to speed up translation services. There is also a need to adopt standardized methods for interchanging data in a form that preserves information about the character set, etc, used to create the data.

Searchable concept-based terminology resources and thesauri are needed. For example, searching for the word "jumpers" will not at present find you references to synonymous terms such as "pullovers" and "cardigans" without returning unwanted information about electronic components and horses. The development of electronic catalogues of related information would seem to be necessary if users are to be able to find the goods they want from a wide range of service providers. UN work on the Basic Semantic Repository could be highly relevant.

Multilingual electronic commerce needs to be backed up by multilingual customer support. Increasing the language awareness of support staff, and the availability of on-line sources of multilingual help files and related documentation, must play a key role in setting up any multinational marketing operation. How this can be done by SMEs is at present not well understood.

Gregor Erbach of the German Research Center for Artificial Intelligence looked at cross-language document retrieval and terminology acquisition for multilingual websites. Human translation is too slow and costly to make it a sensible choice for maintaining multilingual websites, especially when compared with the low cost of starting up a single language site. When it comes to searching the WWW most users will not understand foreign languages, such as English, well enough to formulate an efficient search in it. Searches need to be based on concepts such as "fuzzy searching" or "find something similar to this". Ideally the latter should include "find me documents in these languages that cover the same subject area as this one". It is not, however, sensible to search documents in all languages for equivalent terms: the search must be restricted to those languages that the user has stated he is able to accept (and should probably be weighted in line with the weighting the user has assigned to that language).

Keyword matching does not work across domains or across languages. To be able to identify the equivalent word in another language you need to have a clear indication of the domain from which the word comes. It may be necessary to ask users to select between alternative meanings of a word before undertaking a multilingual equivalence search.

Recall and precision measurements can be used to determine which responses are most relevant to users. The accuracy of such measurements is improved if word stemming is applied to queries, but this technique does not work with highly inflected languages such as Georgian. Automatic query expansion to find synonyms can greatly improve hit rates. Limiting searches to specific locations or sets of files can greatly improve response relevance.

Translation of queries is hard because of the ambiguity of words. Without user feedback the system is likely to select the wrong choice, or request options that are not relevant. Multilingual searches need to be concept based.

You need to be able to identify when more than one translation of a document has been identified during a multilingual search. In such cases only one response is needed: that for the language with the highest quality factor in the user's language acceptance specification.

Fatma Fekih-Ahmed of IRSIT discussed the role of machine translation in the Arabic speaking Mediterranean countries. The Arabic language is one of many unvoweled languages, which makes it more difficult to translate unambiguously. It also has a complex derivational morphology. Preparation of documents in conformance with ISO norms is important for documents for which machine translation is requested. Translators need access to bi-directional word processors as most translations will be to languages with a different writing direction.

It is important to take cultural-specific notions into account when undertaking translations between culturally different linguistic areas. Translation of acronyms is a major problem as acronyms are typically domain specific, and can be ambiguous.

It is important to treat the document as a whole when translating, and not to translate small pieces or changes to the text out of context. This point is particularly true when you are translating multimedia data sets. It is important that text and sound are translated in a consistent manner, and that culturally-specific parts of images are checked/changed.

Translation of numbers is also a problem that has to be considered. For example, where Arabic is being used in India it may be necessary to convert Arabic numerals into Hindi ones. Unless you understand fully the user community the translation is aimed at there is a chance that the translation will be invalid for the use for which it was intended.

Another factor that needs to be considered is image position, and references to images that are based on position. For example, the English phrase "the picture on the right" may need to be translated into the equivalent of "the picture on the left" for use in a language that is read from right-to-left. Automatic translation of such phrases is, however, very difficult to achieve accurately.

Multimedia control scripts are another area where relative positioning is important.The position of windows may need to be redefined for an Arabic presentation, and a different GUI may need to be employed to allow users to cope with multilingual document sets where the principle language tends to change.

There is already an on-line English/Arabic dictionary that can be intergrated with Word. Work is in progress at IRSIT to develop a trilingual French/English/Arabic dictionary.

Deployment

The final session was opened by Bert Bos of W3C with an overview of how far the latest proposals have been deployed to date. HTTP language and character set negotiation is now a common feature in server software, but browsers that can support it are only just being deployed. It will be early next year before there is widespread support for internationalized versions of HTML based on the ISO 10646 characeter set. Before this can take place operating systems capable of supporting the character set must be installed. The latest versions of Windows will support the full Unicode character set, but it may be some time before a significant proportion of systems are upgraded to support this extended code set.

At present support for CSS is limited, but this is likely to change as soon as updates to existing browsers are released. Multilingual browsers are already available.

Before the new forms of URIs can be used, and fonts can be interchanged, it will be necessary to upgrade the interface between browsers and their caches, and to improve content negotiation over HTTP. The W3C Joint Electronic Payments Initiative (JEPI) is expected to develop specifications for negotiating payment methods over the next few months.

Until multilingual authoring systems are available creation of translation services over the web are likely to be slow. This symposium has identified many areas in which further work is required to develop a truly multilingual web service. One thing that is needed to promote a multilingual approach to the WWW is a list of sites that show how best to develop a useful service.

Chris Wendt, Programme Manager for Microsoft's Internet Platform and Tools Division explained what Microsoft was doing to support internationalization on the web. During November Microsoft released a version of their FirstPage WWW editor that has Unicode and full UTF-8 support. All Microsoft fonts are now tagged with Unicode character identifiers.

Language negotiation needs to be extended to allow the software to take note of country qualification of language names. At present URLs are transmitted using the character set assigned to the local windows shell.

The soon-to-be-released Office 97 suite will provide across the range support for Unicode and UTF-8, and will provide facilities for outputting all files in HTML format. ActiveX and Java support are already able to make full use of the ISO 10646 BMP character set.

Mike Carrow of ALIS gave a demonstration of their Tango multilingual HTML document browser and its associated editor. Users can switch between the 19 supported languages on-the-fly in a way that changes the keyboard, the menus and the help files with a single command. It even changes the What's New button to point to a server containing data in the appropriate language!

Erik van der Poel of Netscape started his presentation by publically thanking Microsoft for making the API for their wide character set support facilities publically available. At present Netscape only support Unicode in Windows 95 environments, but this will change next year.

Elisa Tormes of LISA gave a quick presentation on the roles of the recently formed Special Working Group on the WWW of the Localization Industry Standards Association (LISA). Formed after a March 1996 forum in Prague this group is dedicated to the development of shareable, non-proprietary, specifications for localization of software. To date they have been concentrating on surveying user needs, a report on which will be made available through the W3C website by the end of November. A set of simple guidelines for WWW publishers is expected to be published by the end of the year.

Gavin Nicol then presented the main features of the SGML-based DynaWeb server. He explained the advantages of storing data as fully coded SGML documents that were only converted into the form of HTML required by the browser when the document was requested.

Dirk van Gulik from the EC's ISPRA R&D centre explained how the public-domain Apache HTTP server had been enhanced to allow it to provide support for multilingual data sets. Apache provides facilities for converting data stored in Unicode to formats requested by users in the charset parameter of the HTTP header. Apache supports the use of the Q factor to describe language acceptability.

Closing Remarks

The symposium was summarised by Massimo Mauro of EUCOMCIL who stressed the changes that had taken place in the last 2-3 years, from an environment where multilingualism meant providing perhaps data in one or two other European languages to one where any language was considered relevant and possible.

The closing speech of the symposium was given by José Antonio Sainz-Pardo, Vice Minister for Local Government in Andulsia, who highlighted the role the European Commission was playing in making Europe a key part of the Information Society.

Martin Bryan

[I'M-Europe Home Page] [INFO2000 Home Page] [OII Home Page] [Help] [Frequently Asked Questions] [Subject Index] [Text Search] [Europa WWW server]

File created: November 1996

webmaster@echo.lu

Web Internationalization & Multilingualism Symposium Seville, 20-22 November 1996