SILICON VALLEY LOCALIZATION FORUM


	SILICON VALLEY LOCALIZATION FORUM

Towards a Truly Worldwide Web

How XML and Unicode are making it easier to publish multilingual electronic documents.

by Stuart Culshaw

The complete article is available on MultiLingual Communications & Technology, Volume 9, Issue 3.

The Web was originally designed around the ISO 8859-1 character set, which supports only Western European languages. In the early days, when the development of the Web was mainly in the US, this was not a problem, but with the growth in the use of the Internet worldwide, the number of people attempting to distribute non-English content over the Web has grown substantially. In addition, the ability to provide localized content has become an important source of competitive advantage for companies competing in the global market place. The need for more robust standards and protocols to support multilingual publishing on the Internet has become of prime importance.

The recent introduction of a number of new Web technologies and standards has gone some way to improving the situation, but this is more than just a character encoding or a font display problem. The Web is a whole new medium that goes far beyond the possibilities of traditional publishing. The frontier between document content and application user interface is increasingly blurred and documents are becoming applications in themselves. These "dynamic" documents contain a mixture of both document content and information about the content, or metadata. What is needed is a way to meet the needs of today's professional Web publishers and those of tomorrow's dynamic document application architectures.

This is where the Extensible Markup Language (XML) comes in. XML looks set to make large-scale hypertext document publishing to a worldwide audience a reality at last. At the same time it will make the life of the multilingual document publisher a whole lot easier.

Current Problems

HTML, the file format used for Web documents, inherited its reliance on the ISO 8859 character set definitions from the SGML standard on which it is based. ISO 8859 defines a dozen character sets that are useful for languages that use the Latin, Cyrillic, Arabic, Greek and Hebrew alphabets. However, these standards have a limited range of application (eight bits per character, a maximum of 190 characters). Although they are sufficient for 10 or so of the most widely used languages, problems often occur when translating documents from one character set to another (due to the fact that the same code is used to represent different characters in different character sets). Furthermore, ISO 8859 is totally inadequate for representing more complex languages, such as Japanese or Chinese, which contain many thousands characters.

For publishers dealing with these more "exotic" languages, the only solution, until recently, was to rely on national language code standards. Andrew S. Tanenbaum once said, "the nice thing about standards is that there are so many to choose from." Nowhere is this more true than in the domain of national language code standards. There are literally hundreds of different codes available, each created over the years to satisfy constraints and constantly changing technological limitations. For example, there are over three-dozen codes for the Arabic language alone. This overabundance of standards significantly complicates life for the international software developer and the multilingual publisher. But then, you already know that, right?

It was to resolve these problems that the Unicode standard was created. The work of the Unicode Consortium was subsequently combined with that of the ISO 10646 working group and version 2.0 of Unicode/ISO 10646 was released in 1997. The Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing and display of the written texts of the diverse languages of the modern world.

Every character in Unicode is coded using two bytes (or 16 bits), which provides over 65,000 separate positions, 38,885 of which have already been defined. This is enough to represent most of the world's living languages, including single-byte languages such as Western European, Eastern European, and bi-directional Middle Eastern, as well as multibyte languages such as Chinese, Japanese and Korean (CJK). And there's plenty of room left to encode the missing languages as soon as enough of the necessary research is done. Using Unicode, it is finally possible to display several languages within the same electronic document, even if they are based on different alphabets, without worrying about the problem of national language code tables.

Adding Unicode support to existing software applications has proved to be a major undertaking, often requiring a complete rewrite of low-level code. It is only recently that Unicode support has begun appearing in some key desktop applications. You can now find support in Windows NT 4.0, Java, HTML 4.0 and (yes, you guessed it) XML. This at last opens the way for truly multilingual Web-based applications and should accelerate the adoption of Unicode for other desktop applications.

XML: New Standard for a new Medium

Officially endorsed as a W3C Recommendation on February 10, 1998, Extensible Markup Language (XML) version 1.0 is a subset of ISO 8879:1986 - Standard Generalized Markup Language (SGML), the international standard for defining and using content-based markup of information. The SGML standard specifies how to define a set of markup codes (or "tags") to describe the content and structure of particular types of documents. This tag set, and the hierarchical relationships between each tag, are defined in a Document Type Definition (DTD). HTML is an example of an SGML DTD that was designed specifically for the creation of simple Web documents.

XML is essentially a simplified and modernized remake of SGML that removes many of the more complex and less-used features that made SGML somewhat difficult to implement. Unlike SGML, XML enables you to distribute documents without the DTD that was used to create them. This greatly simplifies the publishing procedure and makes it far easier to design tools that support XML. In addition, designers of the XML standard were aware of the importance of internationalization issues. Accordingly, they specified Unicode as the fixed reference character set for XML documents and in doing so went a long way to solving the character encoding problem. XML also provides more robust hyperlinking features than

The term extensible describes the fact that XML enables you to define an infinite number of document markup tags, adapted to different types of application. Of course, authors have been adding all sorts of custom tags, scripts and/or comments to their HTML documents for ages. This additional information is often referred to as metadata. The XML standard provides a more flexible encoding method, and represents a long-awaited alternative to the many incompatible proprietary extensions to HTML currently in use. The clear advantage of XML, then, is its capacity for handling arbitrary data structures which open the way for a powerful new breed of intelligent Web-based applications. These data structures can be used to describe a document, with sections that contain rows, columns, cells and so on (just like in HTML). They may also be used to describe information to be interpreted by a piece of software (or to control a piece of machinery), or they may combine the two. XML provides a way to add this additional, machine-readable information to your documents and data in a way that is not only standardized, but that separates data from the format used to display that data. Using a style sheet, you can specify which information should be displayed to the user, and how it should be formatted. Simply by applying a different style sheet, you can provide a different presentation of your data, without touching the content of the document.

The Future of Multilingual Web Publishing

How is XML going to make things easier for you? Well, let's imagine that you have a collection of documents that describe a particular subject, in a variety of different languages, and that you want to publish these documents on your Web site. If your documents are marked up using XML, you can think of this collection as a kind of database, with each set of XML tags identifying a different "field" of data. The difference with the real database is that your data fields are organized into separate documents, rather than rows and columns in a table. Now, with an XML-aware search engine, you could perform complex searches on your document collection so as to retrieve, for example, all documents that contained your search text in their abstract, but only if the document is written in French or Japanese, As a result of your search, you could choose to generate a new document that contained versions of the original text in each of the selected languages. You could then choose to hide text in one language, or by applying a different style sheet, display text in both languages side-by-side, or paragraph by paragraph. Thanks to http1.1 (the latest version of the hypertext transfer protocol that is used for sending and receiving information over the Web), it is now possible for a browser to automatically select the appropriate language version of a document to deliver, if available, based on a user's preferences. In case a document search is unsuccessful, the server can send a list of alternative choices. Even if a search is successful, the server can send a list of related documents to tell the user about the existence of alternative versions.

One application that is of particular interest to language professionals is OpenTag. The OpenTag format is a markup format based on XML that can be used to encode text extracted from documents in various formats. Rather than converting information from "format X" into OpenTag format, OpenTag is designed so that data can be extracted from "format X," manipulated in an OpenTag environment, and later merged back into the "format X" file. As explained in the OpenTag website (www.opentag.org), if your translation memory databases are stored in OpenTag format, they can become tool and supplier independent. This makes it possible to share these assets among multiple translation service suppliers, who needn't be using the same suite of localization tools. A translation customer would have the enormous benefit of being able to export a document to translate from its native format (by selecting a "Prepare for Translation..." item from a menu, for example) into a file that is directly compatible with the localization process and translation tools to be used. The result of the translation could then easily returned to the customer in the document's native format, or published directly in HTML, after a straightforward conversion process.