SILICON VALLEY LOCALIZATION FORUM
by Stuart Culshaw
The complete article is available on MultiLingual Communications & Technology, Volume 9, Issue 3.
The Web was originally designed around the ISO 8859-1 character set, which supports only Western European languages. In the early days, when the development of the Web was mainly in the US, this was not a problem, but with the growth in the use of the Internet worldwide, the number of people attempting to distribute non-English content over the Web has grown substantially. In addition, the ability to provide localized content has become an important source of competitive advantage for companies competing in the global market place. The need for more robust standards and protocols to support multilingual publishing on the Internet has become of prime importance.
The recent introduction of a number of new Web technologies and standards has gone some way to improving the situation, but this is more than just a character encoding or a font display problem. The Web is a whole new medium that goes far beyond the possibilities of traditional publishing. The frontier between document content and application user interface is increasingly blurred and documents are becoming applications in themselves. These "dynamic" documents contain a mixture of both document content and information about the content, or metadata. What is needed is a way to meet the needs of today's professional Web publishers and those of tomorrow's dynamic document application architectures.
This is where the Extensible Markup Language (XML) comes in. XML looks set to make large-scale hypertext document publishing to a worldwide audience a reality at last. At the same time it will make the life of the multilingual document publisher a whole lot easier.
HTML, the file format used for Web documents, inherited its reliance on the ISO 8859 character set definitions from the SGML standard on which it is based. ISO 8859 defines a dozen character sets that are useful for languages that use the Latin, Cyrillic, Arabic, Greek and Hebrew alphabets. However, these standards have a limited range of application (eight bits per character, a maximum of 190 characters). Although they are sufficient for 10 or so of the most widely used languages, problems often occur when translating documents from one character set to another (due to the fact that the same code is used to represent different characters in different character sets). Furthermore, ISO 8859 is totally inadequate for representing more complex languages, such as Japanese or Chinese, which contain many thousands characters.
For publishers dealing with these more "exotic" languages, the only solution, until recently, was to rely on national language code standards. Andrew S. Tanenbaum once said, "the nice thing about standards is that there are so many to choose from." Nowhere is this more true than in the domain of national language code standards. There are literally hundreds of different codes available, each created over the years to satisfy constraints and constantly changing technological limitations. For example, there are over three-dozen codes for the Arabic language alone. This overabundance of standards significantly complicates life for the international software developer and the multilingual publisher. But then, you already know that, right?
It was to resolve these problems that the Unicode standard was created. The work of the Unicode Consortium was subsequently combined with that of the ISO 10646 working group and version 2.0 of Unicode/ISO 10646 was released in 1997. The Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing and display of the written texts of the diverse languages of the modern world.
Every character in Unicode is coded using two bytes (or 16 bits), which provides over 65,000 separate positions, 38,885 of which have already been defined. This is enough to represent most of the world's living languages, including single-byte languages such as Western European, Eastern European, and bi-directional Middle Eastern, as well as multibyte languages such as Chinese, Japanese and Korean (CJK). And there's plenty of room left to encode the missing languages as soon as enough of the necessary research is done. Using Unicode, it is finally possible to display several languages within the same electronic document, even if they are based on different alphabets, without worrying about the problem of national language code tables.
Adding Unicode support to existing software applications has proved to be a major undertaking, often requiring a complete rewrite of low-level code. It is only recently that Unicode support has begun appearing in some key desktop applications. You can now find support in Windows NT 4.0, Java, HTML 4.0 and (yes, you guessed it) XML. This at last opens the way for truly multilingual Web-based applications and should accelerate the adoption of Unicode for other desktop applications.
Officially endorsed as a W3C Recommendation on February 10, 1998, Extensible Markup Language (XML) version 1.0 is a subset of ISO 8879:1986 - Standard Generalized Markup Language (SGML), the international standard for defining and using content-based markup of information. The SGML standard specifies how to define a set of markup codes (or "tags") to describe the content and structure of particular types of documents. This tag set, and the hierarchical relationships between each tag, are defined in a Document Type Definition (DTD). HTML is an example of an SGML DTD that was designed specifically for the creation of simple Web documents.
XML is essentially a simplified and modernized remake of SGML that removes many of the more complex and less-used features that made SGML somewhat difficult to implement. Unlike SGML, XML enables you to distribute documents without the DTD that was used to create them. This greatly simplifies the publishing procedure and makes it far easier to design tools that support XML. In addition, designers of the XML standard were aware of the importance of internationalization issues. Accordingly, they specified Unicode as the fixed reference character set for XML documents and in doing so went a long way to solving the character encoding problem. XML also provides more robust hyperlinking features than