[Cache from http://www.xerox-emea.com/globaldesign/paper/dtds/dtd-paper.htm; please use this canonical URL/source if possible.]


Localisation Considerations in DTD Design

Richard Ishida
Xerox Global Services

If you are creating XML (or SGML) documents that will be translated, there are things you should have built into your Document Type Definition (DTD) to enable localisation to go smoothly and efficiently. This paper looks at some of the key issues.

The paper refers to standard topics such as character encoding and language declarations, but also covers topics such as implementation of emphasis and style conventions, handling of citations, use of text in attribute values, and the need for an element like HTML's SPAN. In addition, other topics that have traditionally been associated with translation of user interface messages become applicable due to the nature of XML documents. These include the provision of designer's notes, identification of non-translatable text, and use of element ids for automatic translation of elements.

The paper assumes some familiarity with XML and DTD concepts, but remains at the conceptual level, rather than proposing specific DTD constructs.

Introduction

The DTD (Document Type Definition) describes the structure of an XML document, and our focus here will be on issues that someone should bear in mind with regard to the translatability of the document structure they are defining. We will not focus on issues relating to the implementation of content or presentation in this paper.

The paper is not intended to be a definitive or authoritative guide. The intention is to raise some issues and propose possible solutions in an attempt to widen the debate.

The ideas arose from attempts at Xerox to define a DTD to support document creation, document re-purposing and content re-use. Given the space available, thoughts presented here aim to introduce the reader to the key concepts rather than to analyse potential solutions in great depth.

The information is presented as a series of problem statements, followed by suggestions for DTD implementation. The word 'suggestion' has been chosen to convey the spirit of the paper - a foundation for further discussion.

Although the activity at Xerox dealt with SGML documents, this paper will focus specifically on XML applications. Note, however, that some of the suggestions have implications for process and stylesheet implementation.

The comments in this paper should not be taken as Xerox policy or practise. They are the thoughts of the writer.

Allow for identification of non-translatable text

Problem description

There are a number of reasons why it may be necessary to leave particular words, phrases or parts of a document in English during translation.

Example 1: The text in the translated documentation refers to a user interface or part of a user interface that will not be translated (eg. 'Click on the START button.' where START refers to a command on an untranslated user interface).

Example 2: The text provides an example of command syntax the user can type, such as 'DateQuery( Year, Month )'. In this case the translator must be made aware that 'DateQuery' is command syntax and must remain in English, whereas the words 'Year' and 'Month' should be translated since these are prompts for text that the user will enter in their own language.

If text must not be translated for one reason or another it must be possible to indicate this in the XML. The mechanism for this should be defined in the DTD.

Without this, translators may at best waste time deciding what should and should not be translated and at worst will make mistakes.

Suggestions

Provide a SPAN-like element

Problem description

It may be necessary to apply a translate or other attribute (eg. a language indicator) to a range of text which is not bounded by other elements.

Suggestions

Allow for developers to send notes to translators

Problem description

To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:

This can help translators avoid mistakes or provide needed information quickly at the point of need.

Two types of developer’s note are needed:

An 'alert' contains information that the translator must read before translating a piece of text. The translation environment should bring this type of note to the attention of the translator before they begin to translate. (For example, an instruction to the translator to leave parts of the text in the original language if a translate attribute is not available.)

A 'description' provides useful background information that the translator will refer to only if they wish. (For example, a clarification of ambiguity in the original text).

Suggestions

Implementation notes

It was felt that this need not, and should not, be an element for reasons including the following:

The only likely issue that can be foreseen with this approach is that the developer’s note may be quite long. This was however discounted as an issue, since such long notes are unlikely to be appropriate in any case.

Don't use translatable text in attribute values

Problem description

If translatable text is provided as an attribute value rather than element content, the following problems may arise:

The following example is based on an element that associates terms for use in the index with a location in a document. If translatability issues are not borne in mind, you may end up with something such as:

<index-entry id=123 term="iroha: poem for" />

Here 'iroha' and 'poem for' are translatable text indicating an index entry and a sub-entry, but because they are attribute values they could not have non-translatable flags or designer’s notes attached if such were appropriate.

This also applies to any elements that represent text equivalent to the alt and title attributes in HTML 4.0.

Suggestions

Implementation notes

This drives towards a WYSIWYN (What You See Is What You Need) rather than a WYSIWYG (What You See Is What You Get) authoring environment - or one that allows for the conditional hiding or visibility of certain elements, since not all elements are part of the visible document.

Make emphasis & style conventions logical, not presentational

Problem description

Emphasis and style formatting will be applied in different ways in different cultures and writing systems. For example:

The stylesheet should provide a method of applying different presentational mechanisms to a range of text on a language by language basis. This however needs to be facilitated by the DTD design.

It is dangerous to invent elements or attributes in the DTD which are associated with presentational, rather than logical, attributes. Allowing the author to choose bold or italicise tags for emphasis and style elements can lead to:

As an example of the second point take a hypothetical document written in Japanese for on-screen viewing. Because bold and italic don't work well on screen for the complex Japanese characters at the small point sizes the Japanese DTD designer expects all highlighting to be rendered as underlined text. Therefore, the DTD provides only one element for highlighting both new terms and emphasis. When that document is translated into English (using a set of characters that do allow for reasonable rendering of bold and italic text) it may be desirable to render the new terms as bold text but the emphasis as italic text. However, because the Japanese designer only implemented a single element, this is no longer achievable without changing the DTD and manually editing the XML. The fact that the DTD designer was thinking in terms of presentation restricted the possibilities where more choices or preferences actually existed.

Suggestions

Implementation notes

This approach will have the additional benefit of providing more discipline and consistency in the way authors use and apply emphasis and style formatting.

Note that the success of this approach relies on the ability to state the language of a particular piece of text. This will be addressed in the section entitled "Use language declarations at the top of the document and for any element or range of text in a language other than the document language".

Identify and clearly describe tags which translators may need to change

Problem description

The recommendations for this section touch on process issues as well as DTD design.

Certain element tags, such as emphasis, will need to be manipulated by the translator. For example, because the syntax of languages varies, emphasis-related highlighting will stretch across a different range of characters from language to language, and may need to be removed or doubled up by the translator in some cases. For example, verbs typically appear in the middle of English sentences, at the end of Japanese sentences, and at the beginning of Arabic sentences.

It is expected (but not necessarily always true) that many such elements which contain content will be in-line in nature rather than block elements, since the key driver for their adaptation will be to achieve a natural sentence flow in the foreign language. We will refer to such elements as phrase elements. Further, many such tags are likely to be hooks for the application of formatting style. Examples may include emphasis, hyperlink, subscript, superscript, citation and span.

Empty tags also fall into this category. In order to position an empty tag appropriately in the text the translator will usually need to know what it refers to.

Much of the time the translator will simply need to know the meaning of the tags so that they can fit the translation around the protected tag names in the appropriate way. In many cases, this will also mean re-ordering multiple tags.

There will, however, be occasions when the translator wishes to remove or insert new tags because of the requirements of their language. For example, a particular type of emphasis may be expressed by the language rather than by formatting changes and the tag can be removed. Alternatively, some languages may require two ranges of words to be emphasised in the translation where there is only one range in the original.

Suggestions

Provide truly unique element identifiers to enable change analysis

Problem description

In order to most effectively re-use translated text where elements are re-used (either across update versions or across deliverables) it is necessary to have a totally unique and eternally persistent id associated with an item of content. This id allows the translation tool to correctly associate original and translated text units with each other prior to examination for changes. This can be referred to as 'change analysis'.

The potential for re-use of translations is very appealing in terms of productivity and cost savings for product launch.

Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (commonly referred to as 'translation memory') techniques, which simply look for similar text in the database without being able to tell whether the context of its use is the same.

This change analysis technique has been possible with UI messages in the past, but the introduction of element-based XML documents will allow for its use in documents also. Note however that where text entities will be re-used across products, these ids must be totally unique.

Suggestions

Implementation notes

If you have a structured authoring environment that automatically provides ids for elements, make sure that those ids are not only unique within a given document. Otherwise you will have problems when you chunk up text for content re-use.

Use character encoding declarations

Problem description

The encoding of the text in an entity must be declared or predictable in order for the application reading it to understand the contents of that entity – whether this is a document entity or an external parsed entity.

An encoding used to create the original text might not support the characters required for translation. If the translated text is in another encoding than the original, it is especially important to declare which encodings are being used where. In addition, if different encodings are being used for external entities, some conversion will be necessary to produce a single encoding throughout the document instance.

If the application cannot automatically apply the correct character set mappings the document will not be readable.

It would be useful to always declare the encoding used at the top of any file, in ASCII, to aid in reading or processing files.

Implementation notes

XML allows for an encoding declaration at the very beginning of any document entity or external entity using the following text declaration:

<?xml version="1.0" encoding="encoding_name" ?>

The attribute value represented here by 'encoding name' should be a name from the IANA 'charset' registry (see http://www.iana.org/assignments/character-sets).

All XML processors must recognise UTF-8 and UTF-16 encodings for the Unicode character set, and most will also handle ASCII (7-bit) and ISO 8895-1 encodings. Entities encoded using UTF-16 should insert a byte order mark (#xFEFF) at the beginning of a file (see Appendix B of the Unicode Standard). The processor or application must be able to deal with this byte order mark. Entities which do not use a UTF-8 or UTF-16 encoding are required to include the text declaration with an encoding declaration. This text declaration must be in an encoding equivalent to ASCII or UTF-16, so that the processor can read it.

Suggestions

Use language declarations at the top of the document and for any element or range of text in a language other than the document language

Problem description

A number of stylesheet settings will vary according to the locale of the text – ie. the language and market region. Examples include text expansion, hyphenation, wrapping rules, colour usage, fonts, spell checking, line height and inter-line spacing, quotation marks and other punctuation, etc. For the appropriate presentation to be applied automatically to documents in different languages it is therefore essential to know the language of the text.

It would also be useful to indicate the locale of the document as a whole to facilitate both processing and identification of translated documents during localisation and content management.

It should also be possible to indicate the language of the XML content for any element or range of text where the language differs from that of the document as a whole. Note that this includes graphics, audio and other unparsed entities, which may need labelling for or treatment specific to a given locale.

Implementation notes

XML applies this principle through the use of the reserved attribute xml:lang. (Note that this attribute must be declared in the DTD before use.)

The XML standard describes the permissible values of xml:lang, which start with a language identifier followed by zero or more subcodes separated by hyphens and indicating regions or dialects.

Note that calling this a 'language' attribute can sometimes be a slight misnomer. In a localisation scenario this attribute value may be used to relate information to a particular market region or 'locale' (for example Canadian French fr-CA vs. European French fr-FR language variants, or the data format appropriate for a country with more than one language such as Switzerland).

The attribute value should be set to a string as described by the XML standard, but as defined by RFC 3066.

Suggestions

Establish a method for dealing with multilingual documents

Problem description

It may be necessary to indicate that a particular document is written in more than one language. For example, a tri-fold or a quick-start manual may contain parallel text written in French, German and Italian for the Swiss market. In this case it may be necessary to enable the language declaration for the document as a whole to refer to a number of language or locale identifiers at the same time, since there is no one overall language associated with this document.

Note that the xml:lang attribute itself does not allow for multiple values to be specifically named in a single declaration.

Suggestions

Alternative suggestions

Delimit quotations from other sources

Problem description

Take an example such as the following:

Selecting Initialise Auditron will always produce a confirmation screen. If OK is selected twice, the Auditron will be initialised and the account data deleted.

The text 'Initialise Auditron' is a quotation from the user interface. Since these messages have typically already been translated, the translator's job is to locate the actual translation previously supplied so as to maintain consistency. The same principle applies to quotations from whatever source. If they already exist in translation, the translator's task is to find and re-use the original phrasing from the previous translation activity.

For example, 'Initialise Auditron' may have been translated in Spanish in any of the following ways:

Any of these are acceptable translations, but the translator must try to choose the exact words used in the context which is being quoted.

Similarly, the text 'OK' could be translated in any of the following ways:

This time there is a different reason to choose the appropriate translation since the standard is different on Windows, Macintosh and Unix.

Note also that quotation marks are best added by the stylesheet rather than the author. The actual punctuation marks used for quotations and quotations within quotations vary considerably from language to language, even across languages based on the Latin script. Applying the quotation marks via the stylesheet guarrantees consistent usage and placement of quotation marks, and allows for quotation marks to be easily localised or missed out altogether in repurposed information (eg. in a list of quotations derived from the text using XSLT).

The ideas expressed in the previous paragraph try to reinforce the idea that the author should be concerned with content, not presentation, by treating the quotation mark as presentational. Note however that there could be instances where such an approach is not valid. For example, if you were quoting a quotation that itself contained quotation marks and you wanted to ensure that the text is preserved character-for-character as in the original. In such a case it would be valid for the author to enter characters rather than markup.

Suggestions

The following shows how the example at the beginning of this section could be coded:

<para>
Selecting <quote src="UI:auditron initialisation">Initialise Auditron </quote> will always produce a confirmation screen. If <quote src="UI:auditron initialisation">OK</quote> is selected twice, the Auditron will be initialised and the account data deleted.
</para>

Include pointers to UI messages

Problem description

This takes the idea in the previous section a little further. If a UI message is quoted in the documentation, it is likely to improve productivity and quality of localisation to pull the translated text directly from the UI database, rather than asking the translator to type it in again.

In this case, the example given immediately above may look something more like the following:

<para>
Selecting <ui-message name="Initialise Auditron" id="msg123" /> will always produce a confirmation screen. If <ui-message name="OK" id="msg124" /> is selected twice, the Auditron will be initialised and the account data deleted.
</para>

Suggestions

Final thoughts

The end goals here are to maximise efficiency of content localization and address accessibility concerns for multinational users. Especially since the advent of the World Wide Web, this is no longer something that can be treated as an afterthought. In today's increasingly global environment, localisability should be just as much 'the way we do things' as interoperability, scalability and portability. Ignoring these ideas can add significant cost and delay, and reduce the final quality of multilingual products.

This paper merely attempts to provide an illustration of ways in which localisation should be considered as part of schema design. It is far from an exhaustive survey of all the points that must be considered, and indeed we are still in an exploratory phase at the moment.

Also, although inclusion of localisation related markup in vocabularies will always be a step in the right direction, the maximum benefit will only be realised if the localization community is involved in standardising the approach to identifying non-translatable content, providing designer's notes, and so forth. This standardisation will allow computerised translation tools to automatically recognise the vocabulary of the XML data being localised, facilitating and making more efficient the exchange and processing of data at the point of localization.

The full internationalization of DTD design will also include consideration of such things as white space handling, use of markup vs. Unicode control characters, use of alternative content or entities for different markets, provision of meta data to describe document structure for localisation tools, provision of information about available space and other aspects of content affected by localization, the ability to tag terminology and semantics within content, a way of expanding the language tag concept to adequately cover the locale and script oriented needs of the localization community, incorporation of markup to support international script features (such as ruby and arabic directionality), and so on.

The ideal, in my mind, would be to establish re-usable standards, namespaces, guidelines, practises and the like, so that it becomes second nature to design with the global user in mind, and so that the wheel does not need to be reinvented each time. To make this work designers and developers will need to assume responsibility for creating global documents, and not just leave it to the localisation community.

References

  1. Internet Assigned Number Authority (IANA), Official Names for Character Sets. Available from: http://www.iana.org/assignments/character-sets

  2. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation. Section 2.12 Language Identification. (See also the erratum at http://www.w3.org/XML/xml-V10-2e-errata#E11). Available from: http://www.w3.org/TR/REC-xml#sec-lang-tag

  3. H. Alvestrand, Tags for the Identification of Languages, IETF RFC 3066. Available from: http://www.ietf.org/rfc/rfc3066.txt

  4. W3C Internationalisation Group, Language tagging in HTML and XML. Available from: http://www.w3.org/International/O-HTML-tags.html

  5. Yves Savourel, XML Internationalisation and Localization. Available from: Sams. ISBN 0672320967.

  6. Richard Ishida, Yves Savourel, ITS Requirements. Available from: http://groups.yahoo.com/group/lisa-its/files/ITS-Requirements/ITS-Requirements.html

For more information, contact GoGlobal@gbr.xerox.com