[Cache from http://www.xerox-emea.com/globaldesign/paper/dtds/dtd-paper.htm; please use this canonical URL/source if possible.]
If you are creating XML (or SGML) documents that will be translated, there are things you should have built into your Document Type Definition (DTD) to enable localisation to go smoothly and efficiently. This paper looks at some of the key issues.
The paper refers to standard topics such as character encoding and language declarations, but also covers topics such as implementation of emphasis and style conventions, handling of citations, use of text in attribute values, and the need for an element like HTML's SPAN. In addition, other topics that have traditionally been associated with translation of user interface messages become applicable due to the nature of XML documents. These include the provision of designer's notes, identification of non-translatable text, and use of element ids for automatic translation of elements.
The paper assumes some familiarity with XML and DTD concepts, but remains at the conceptual level, rather than proposing specific DTD constructs.
The DTD (Document Type Definition) describes the structure of an XML document, and our focus here will be on issues that someone should bear in mind with regard to the translatability of the document structure they are defining. We will not focus on issues relating to the implementation of content or presentation in this paper.
The paper is not intended to be a definitive or authoritative guide. The intention is to raise some issues and propose possible solutions in an attempt to widen the debate.
The ideas arose from attempts at Xerox to define a DTD to support document creation, document re-purposing and content re-use. Given the space available, thoughts presented here aim to introduce the reader to the key concepts rather than to analyse potential solutions in great depth.
The information is presented as a series of problem statements, followed by suggestions for DTD implementation. The word 'suggestion' has been chosen to convey the spirit of the paper - a foundation for further discussion.
Although the activity at Xerox dealt with SGML documents, this paper will focus specifically on XML applications. Note, however, that some of the suggestions have implications for process and stylesheet implementation.
The comments in this paper should not be taken as Xerox policy or practise. They are the thoughts of the writer.
There are a number of reasons why it may be necessary to leave particular words, phrases or parts of a document in English during translation.
Example 1: The text in the translated documentation refers to a user interface or part of a user interface that will not be translated (eg. 'Click on the START button.' where START refers to a command on an untranslated user interface).
Example 2: The text provides an example of command syntax the user can type, such as 'DateQuery( Year, Month )'. In this case the translator must be made aware that 'DateQuery' is command syntax and must remain in English, whereas the words 'Year' and 'Month' should be translated since these are prompts for text that the user will enter in their own language.
If text must not be translated for one reason or another it must be possible to indicate this in the XML. The mechanism for this should be defined in the DTD.
Without this, translators may at best waste time deciding what should and should not be translated and at worst will make mistakes.
Declare a common attribute with a name such as
translate in the DTD such that it can be applied to any
The values should be something like
no, the default being
yes. The default assignment
means that the author need only use the attribute to indicate an element that
does not need translation.
Institute a process check to ensure that the attribute has been correctly used before handoff for translation.
Ensure that, wherever possible, translatable text is captured in elements rather than attributes, since you cannot apply non-translate flags or other meta-data to attribute text (this is discussed in more detail in the section "Don't use translatable text in attribute values").
It may be necessary to apply a
translate or other
attribute (eg. a language indicator) to a range of text which is not bounded by
span element like that provided in HTML
4.0 which has no special presentational or structural properties but which can
carry the necessary attributes. For example, the element could be used as
follows if the hard panel is not translated:
On the hard panel the text
<span language="en" translate="no">Document Feeder</span> indicates
the place to insert your
To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:
give the translator hints about how to translate part of the content
explain why text is not translated, or describe the use of conditional text
expand on the meaning or contextual usage of a particular element, such as what a variable refers to
indicate why a piece of text is emphasised (important, sarcastic, etc.)
clarify ambiguity sufficiently to allow correct translation
This can help translators avoid mistakes or provide needed information quickly at the point of need.
Two types of developer’s note are needed:
An 'alert' contains information that the translator
must read before translating a piece of text. The translation
environment should bring this type of note to the attention of the translator
before they begin to translate. (For example, an instruction to the translator
to leave parts of the text in the original language if a
attribute is not available.)
A 'description' provides useful background information that the translator will refer to only if they wish. (For example, a clarification of ambiguity in the original text).
Provide two common attributes,
description. These attributes can be attached to any element for
which a developer’s note is appropriate, and the note goes in the attribute
value. (Such notes may also be appropriate for empty elements and elements
pointing to things such as graphics.)
<para description="Empty is a verb, not an
If the note applies to a single word or sentence that is not
bounded already by tags, the
span element described earlier can be
Check that your localisation group is expecting and can handle these in their translation tools.
It was felt that this need not, and should not, be an element for reasons including the following:
this text will not need any further qualification through attributes other than to specify whether it is an alert or concept description, and will not have any children
the text does not need to be presented for translation, and it will not be necessary to label the text as non-translatable in each case
an element name is less easily handled by translation tools than an attribute name
the element would have to be a child of almost any other element, adding to the complexity of the DTD and the choices of the author
this is very much meta information rather than content.
The only likely issue that can be foreseen with this approach is that the developer’s note may be quite long. This was however discounted as an issue, since such long notes are unlikely to be appropriate in any case.
If translatable text is provided as an attribute value rather than element content, the following problems may arise:
It is impossible to apply meta-information such as unique ids, non-translatable flags, designer's notes, etc.
It is difficult to identify which attribute text is to be translated and which is not in a generic fashion (ie. without recalibrating the translation tools to recognise a set list for each DTD).
The following example is based on an element that associates terms for use in the index with a location in a document. If translatability issues are not borne in mind, you may end up with something such as:
<index-entry id=123 term="iroha: poem for" />
Here 'iroha' and 'poem for' are translatable text indicating an index entry and a sub-entry, but because they are attribute values they could not have non-translatable flags or designer’s notes attached if such were appropriate.
This also applies to any elements that represent text equivalent to
title attributes in HTML 4.0.
Ensure that all translatable text is captured in elements.
For example, implement the index-entry element shown in the example above as follows:
This drives towards a WYSIWYN (What You See Is What You Need) rather than a WYSIWYG (What You See Is What You Get) authoring environment - or one that allows for the conditional hiding or visibility of certain elements, since not all elements are part of the visible document.
Emphasis and style formatting will be applied in different ways in different cultures and writing systems. For example:
some cultures use totally different methods of emphasising text to that used commonly in the West (eg. 'amikake' and 'wakiten' in Japanese), or they may express emphasis using language rather than presentation.
For electronic documents, Japanese may prefer not to use bolding or italicisation in small font sizes due to the complexity of the characters.
applying capitalisation as a way of indicating procedure names or on-screen text will fail in most non-Latin scripts, since these scripts have no upper- vs. lower-case distinction. (Also, there may not be an equivalent to proportional vs. mono-spaced fonts, as used in this document for tag names and examples)
The stylesheet should provide a method of applying different presentational mechanisms to a range of text on a language by language basis. This however needs to be facilitated by the DTD design.
It is dangerous to invent elements or attributes in the DTD which
are associated with presentational, rather than logical, attributes. Allowing
the author to choose
italicise tags for
emphasis and style elements can lead to:
inconsistency in the way emphasis is applied in the document by the author
an inability to distinguish between applications of the tag which may need to be formatted differently in another language.
As an example of the second point take a hypothetical document written in Japanese for on-screen viewing. Because bold and italic don't work well on screen for the complex Japanese characters at the small point sizes the Japanese DTD designer expects all highlighting to be rendered as underlined text. Therefore, the DTD provides only one element for highlighting both new terms and emphasis. When that document is translated into English (using a set of characters that do allow for reasonable rendering of bold and italic text) it may be desirable to render the new terms as bold text but the emphasis as italic text. However, because the Japanese designer only implemented a single element, this is no longer achievable without changing the DTD and manually editing the XML. The fact that the DTD designer was thinking in terms of presentation restricted the possibilities where more choices or preferences actually existed.
Provide distinct elements for style conventions and emphasis. Style conventions relate to such things as special formatting for section titles, key cap representations, new terms, example code or on-screen text, etc. Emphasis is a linguistic function.
Rather than allowing the author to select the presentational aspects of emphasis when creating the document, require them to indicate the type or intention of the emphasis. The stylesheet should produce the appropriate presentation for the current language.
The previous suggestion also applies where finer distinctions are rendered by attribute settings. For example <emphasis style="heavy-stress"> would be better than <emphasis style="bold">. The 'type' attributes of these elements should indicate the use of the tag. For example, the emphasis element may include attributes such as importance, loudness, irony, distinctiveness, etc. The style convention element could include such things as section title, keycap, screen text, code, new term, etc.
Use conditional clauses in the stylesheet to adapt the presentation style applied according to the language of the text (set by the language attribute of the current element or its ancestor) or use alternative stylesheets.
This approach will have the additional benefit of providing more discipline and consistency in the way authors use and apply emphasis and style formatting.
Note that the success of this approach relies on the ability to state the language of a particular piece of text. This will be addressed in the section entitled "Use language declarations at the top of the document and for any element or range of text in a language other than the document language".
The recommendations for this section touch on process issues as well as DTD design.
Certain element tags, such as
emphasis, will need to be
manipulated by the translator. For example, because the syntax of languages
varies, emphasis-related highlighting will stretch across a different range of
characters from language to language, and may need to be removed or doubled up
by the translator in some cases. For example, verbs typically appear in the
middle of English sentences, at the end of Japanese sentences, and at the
beginning of Arabic sentences.
It is expected (but not necessarily always true) that many such
elements which contain content will be in-line in nature rather than block
elements, since the key driver for their adaptation will be to achieve a
natural sentence flow in the foreign language. We will refer to such elements
as phrase elements. Further, many such tags are likely to be hooks for the
application of formatting style. Examples may include
Empty tags also fall into this category. In order to position an empty tag appropriately in the text the translator will usually need to know what it refers to.
Much of the time the translator will simply need to know the meaning of the tags so that they can fit the translation around the protected tag names in the appropriate way. In many cases, this will also mean re-ordering multiple tags.
There will, however, be occasions when the translator wishes to remove or insert new tags because of the requirements of their language. For example, a particular type of emphasis may be expressed by the language rather than by formatting changes and the tag can be removed. Alternatively, some languages may require two ranges of words to be emphasised in the translation where there is only one range in the original.
Use easily recognisable and descriptive names for phrase element tags, to help the translator understand their meaning and how they are used.
Ensure that translation tools allow the translator to move, add or delete phrase element tags, while protecting the others.
Consider defining all phrase element tags in a separate entity to the rest of the DTD, so that the translation tools can easily identify them if there is no ambiguity about their use.
Make a list of all such tags available to the localisation group before translation, with their meanings.
In order to most effectively re-use translated text where elements are re-used (either across update versions or across deliverables) it is necessary to have a totally unique and eternally persistent id associated with an item of content. This id allows the translation tool to correctly associate original and translated text units with each other prior to examination for changes. This can be referred to as 'change analysis'.
The potential for re-use of translations is very appealing in terms of productivity and cost savings for product launch.
Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (commonly referred to as 'translation memory') techniques, which simply look for similar text in the database without being able to tell whether the context of its use is the same.
This change analysis technique has been possible with UI messages in the past, but the introduction of element-based XML documents will allow for its use in documents also. Note however that where text entities will be re-used across products, these ids must be totally unique.
Every element containing translatable text that is likely to be re-used as a document chunk or entity should have a required id.
Certain elements within document chunks and entities should also carry ids to facilitate the comparison of old vs. updated versions. This may apply to structural elements such as paragraphs, but also to certain phrase elements such as those delimiting quotations.
The id for an element should be totally unique at the time of creation and remain the same across all instances of that element in a given document or entity (including translated versions). Updating the text in the element should not change the id.
The id should allow any language version to be correlated with the original source or with any other language version of the text.
If your authoring tool allows it, automatically insert the id at the time of creation rather than ask the author to create something in a CDATA attribute.
If you have a structured authoring environment that automatically provides ids for elements, make sure that those ids are not only unique within a given document. Otherwise you will have problems when you chunk up text for content re-use.
The encoding of the text in an entity must be declared or predictable in order for the application reading it to understand the contents of that entity – whether this is a document entity or an external parsed entity.
An encoding used to create the original text might not support the characters required for translation. If the translated text is in another encoding than the original, it is especially important to declare which encodings are being used where. In addition, if different encodings are being used for external entities, some conversion will be necessary to produce a single encoding throughout the document instance.
If the application cannot automatically apply the correct character set mappings the document will not be readable.
It would be useful to always declare the encoding used at the top of any file, in ASCII, to aid in reading or processing files.
XML allows for an encoding declaration at the very beginning of any document entity or external entity using the following text declaration:
<?xml version="1.0" encoding="encoding_name" ?>
The attribute value represented here by 'encoding name' should be a name from the IANA 'charset' registry (see http://www.iana.org/assignments/character-sets).
All XML processors must recognise UTF-8 and UTF-16 encodings for the Unicode character set, and most will also handle ASCII (7-bit) and ISO 8895-1 encodings. Entities encoded using UTF-16 should insert a byte order mark (#xFEFF) at the beginning of a file (see Appendix B of the Unicode Standard). The processor or application must be able to deal with this byte order mark. Entities which do not use a UTF-8 or UTF-16 encoding are required to include the text declaration with an encoding declaration. This text declaration must be in an encoding equivalent to ASCII or UTF-16, so that the processor can read it.
In order to simplify the localisation process and ensure compatibility of entities for content management, store all entities with one specific type of Unicode encoding (eg. UTF-8 or UTF-16). If the application rendering or producing a document needs to use another encoding, implement a conversion process.
If storing text as UTF-16, insert the byte-order mark at the beginning of every file.
Add the encoding declaration to every entity (document or external) when storing as XML. If the encoding is UTF-8 or UTF-16, the XML specification does not require this, but it can still be useful to use the encoding declaration for file management activities (human or automated).
Note that, since the XML standard prohibits a text declaration appearing anywhere other than at the beginning of an external entity, the declaration must be removed before joining together two entities.
Use the IANA registered character set names for the values of
encoding attribute. Use the values listed in the XML standard
for the Unicode, ISO and Japanese encodings mentioned there.
If not using Unicode throughout, choose commonly used encodings for the text. For example there are many encodings for Cyrillic in Russia, choose the encoding that provides the greatest opportunity for interoperability.
Enforce use of the declaration through validity checkers and the check-in mechanisms of content management repositories.
A number of stylesheet settings will vary according to the locale of the text – ie. the language and market region. Examples include text expansion, hyphenation, wrapping rules, colour usage, fonts, spell checking, line height and inter-line spacing, quotation marks and other punctuation, etc. For the appropriate presentation to be applied automatically to documents in different languages it is therefore essential to know the language of the text.
It would also be useful to indicate the locale of the document as a whole to facilitate both processing and identification of translated documents during localisation and content management.
It should also be possible to indicate the language of the XML content for any element or range of text where the language differs from that of the document as a whole. Note that this includes graphics, audio and other unparsed entities, which may need labelling for or treatment specific to a given locale.
XML applies this principle through the use of the reserved attribute
xml:lang. (Note that this attribute must be declared in the DTD
The XML standard describes the permissible values of
xml:lang, which start with a language identifier followed by zero
or more subcodes separated by hyphens and indicating regions or dialects.
Note that calling this a 'language' attribute can
sometimes be a slight misnomer. In a localisation scenario this attribute value
may be used to relate information to a particular market region or
'locale' (for example Canadian French
fr-FR language variants, or the data format
appropriate for a country with more than one language such as Switzerland).
The attribute value should be set to a string as described by the XML standard, but as defined by RFC 3066.
There should always be at least one language declaration at the top of the document or entity that defines the language for the document or entity as a whole.
Any blocks or in-line runs of text that introduce another
language should have additional language declarations applied through the use
of attributes. This may also involve use of the
xml:lang attribute. The value of the
attribute should conform to the values used in the XML standard - ie. a
sequence of tokens separated by hyphens that narrow the value down to a
specific language or locale (eg. en-UK indicates UK English rather than US
When a document or chunk is translated, the translation process must ensure that the attribute value for the highest element in the entity is changed to reflect the new locale of the document. Translators are less likely to need to change the values of locale attributes embedded in the text, although this should be possible if appropriate.
If at all possible, enforce the use of the language attribute at the highest level of any document or external parsed entity in validity checking applications or check-in mechanisms when storing the file in a content management repository.
Consider, in addition, the appropriateness of establishing a list of rfc3066-based labels that your authors will need. By adding these values as an enumerated list of attribute values in the DTD you can easily ensure that the authors enter valid values (and in tools that provide such values as a pull-down list, help them add these labels more easily) Of course, a procedure should be documented to allow for the list to be added to over time.
It may be necessary to indicate that a particular document is written in more than one language. For example, a tri-fold or a quick-start manual may contain parallel text written in French, German and Italian for the Swiss market. In this case it may be necessary to enable the language declaration for the document as a whole to refer to a number of language or locale identifiers at the same time, since there is no one overall language associated with this document.
Note that the
xml:lang attribute itself does not allow
for multiple values to be specifically named in a single declaration.
Declare a high level element in the DTD which can have as
children any number of document-level elements. A possible title for this
element would be
multilingual-document. The advantage of this
approach is that the top level element indicates clearly at the beginning of
the file all the languages or locales for which the document is
Allow the element to take an attribute which allows for a number
of locales to be specified by a list of values - each with the syntax described
Enforce the use of language attributes to indicate where each main language part begins and ends.
Thus a Swiss document may have a structure along the following lines:
Chapters in French ...
<document-element xml:lang =“de-CH”>
Chapters in German ...
<document-element xml:lang =“it-CH”>
Chapters in Italian ...
Rather than create your own element, use the
xml:lang attribute at the document level with the value set to
mul (for 'multiple'). Then use
for each document part as normal to declare the language of that part. The
advantage of this approach is that it may allow for better standardisation of
the approach to dealing with multiple-locale documents by reducing the
likelihood of variability in the name of the attribute used. The disadvantage
is that you need to search through the document to find out what locales are
represented in it.
Thus our Swiss document above may have the following structure:
Chapters in French ...
Chapters in German ...
Chapters in Italian ...
Take an example such as the following:
Selecting Initialise Auditron will always produce a confirmation screen. If OK is selected twice, the Auditron will be initialised and the account data deleted.
The text 'Initialise Auditron' is a quotation from the user interface. Since these messages have typically already been translated, the translator's job is to locate the actual translation previously supplied so as to maintain consistency. The same principle applies to quotations from whatever source. If they already exist in translation, the translator's task is to find and re-use the original phrasing from the previous translation activity.
For example, 'Initialise Auditron' may have been translated in Spanish in any of the following ways:
Inicializar el Auditrón
Inicialización del Auditrón
Any of these are acceptable translations, but the translator must try to choose the exact words used in the context which is being quoted.
Similarly, the text 'OK' could be translated in any of the following ways:
This time there is a different reason to choose the appropriate translation since the standard is different on Windows, Macintosh and Unix.
Note also that quotation marks are best added by the stylesheet rather than the author. The actual punctuation marks used for quotations and quotations within quotations vary considerably from language to language, even across languages based on the Latin script. Applying the quotation marks via the stylesheet guarrantees consistent usage and placement of quotation marks, and allows for quotation marks to be easily localised or missed out altogether in repurposed information (eg. in a list of quotations derived from the text using XSLT).
The ideas expressed in the previous paragraph try to reinforce the idea that the author should be concerned with content, not presentation, by treating the quotation mark as presentational. Note however that there could be instances where such an approach is not valid. For example, if you were quoting a quotation that itself contained quotation marks and you wanted to ensure that the text is preserved character-for-character as in the original. In such a case it would be valid for the author to enter characters rather than markup.
Capture quotations in an element. Note that this is not primarily concerned with the presentational concerns such as the addition or not of quotation marks, but rather semantically identifies a piece of text as being lifted from some other location.
Include an attribute such as
src to indicate the
source of the quotation. This helps the translator find the appropriate
Ask the author not to add quotation marks but apply them using the stylesheet in a manner appropriate to the current language (unless you are trying to preserve the character-for-character identity of some text).
The following shows how the example at the beginning of this section could be coded:
Selecting <quote src="UI:auditron initialisation">Initialise Auditron </quote> will always produce a confirmation screen. If <quote src="UI:auditron initialisation">OK</quote> is selected twice, the Auditron will be initialised and the account data deleted.
This takes the idea in the previous section a little further. If a UI message is quoted in the documentation, it is likely to improve productivity and quality of localisation to pull the translated text directly from the UI database, rather than asking the translator to type it in again.
In this case, the example given immediately above may look something more like the following:
Selecting <ui-message name="Initialise Auditron" id="msg123" /> will always produce a confirmation screen. If <ui-message name="OK" id="msg124" /> is selected twice, the Auditron will be initialised and the account data deleted.
To implement a means of achieving this will require prior consideration of how the process will provide access to the UI message database for the author, and how the UI database will be consistently populated with appropriate data. You will need to consult with the localisation tools group to establish a method for achieving this via the inclusion of a dedicated cross-reference element.
Then add an appropriate cross-referencing facility to the DTD.
name attribute in the example is suggested
because showing the actual text helps the translator understand how the text
will read. Note that this is not for the translator to translate -
purely for information.
The end goals here are to maximise efficiency of content localization and address accessibility concerns for multinational users. Especially since the advent of the World Wide Web, this is no longer something that can be treated as an afterthought. In today's increasingly global environment, localisability should be just as much 'the way we do things' as interoperability, scalability and portability. Ignoring these ideas can add significant cost and delay, and reduce the final quality of multilingual products.
This paper merely attempts to provide an illustration of ways in which localisation should be considered as part of schema design. It is far from an exhaustive survey of all the points that must be considered, and indeed we are still in an exploratory phase at the moment.
Also, although inclusion of localisation related markup in vocabularies will always be a step in the right direction, the maximum benefit will only be realised if the localization community is involved in standardising the approach to identifying non-translatable content, providing designer's notes, and so forth. This standardisation will allow computerised translation tools to automatically recognise the vocabulary of the XML data being localised, facilitating and making more efficient the exchange and processing of data at the point of localization.
The full internationalization of DTD design will also include consideration of such things as white space handling, use of markup vs. Unicode control characters, use of alternative content or entities for different markets, provision of meta data to describe document structure for localisation tools, provision of information about available space and other aspects of content affected by localization, the ability to tag terminology and semantics within content, a way of expanding the language tag concept to adequately cover the locale and script oriented needs of the localization community, incorporation of markup to support international script features (such as ruby and arabic directionality), and so on.
The ideal, in my mind, would be to establish re-usable standards, namespaces, guidelines, practises and the like, so that it becomes second nature to design with the global user in mind, and so that the wheel does not need to be reinvented each time. To make this work designers and developers will need to assume responsibility for creating global documents, and not just leave it to the localisation community.
Internet Assigned Number Authority (IANA), Official Names for Character Sets. Available from: http://www.iana.org/assignments/character-sets
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation. Section 2.12 Language Identification. (See also the erratum at http://www.w3.org/XML/xml-V10-2e-errata#E11). Available from: http://www.w3.org/TR/REC-xml#sec-lang-tag
H. Alvestrand, Tags for the Identification of Languages, IETF RFC 3066. Available from: http://www.ietf.org/rfc/rfc3066.txt
W3C Internationalisation Group, Language tagging in HTML and XML. Available from: http://www.w3.org/International/O-HTML-tags.html
Yves Savourel, XML Internationalisation and Localization. Available from: Sams. ISBN 0672320967.
Richard Ishida, Yves Savourel, ITS Requirements. Available from: http://groups.yahoo.com/group/lisa-its/files/ITS-Requirements/ITS-Requirements.html