[Cache from http://groups.yahoo.com/group/lisa-its/files/ITS-Requirements/ITS-Requirements.html; please use this canonical URL/source if possible.]
When creating DTDs it is important to include constructs that meet the needs of the localisation community, and that enable documents produced using the DTD to be successfully rendered in another language and / or locale. This document sets out to list the key requirements in this regard. It will be used to provide a framework and direction for a detailed solution proposal (or set of proposals) to be developed later.
This section describes the status of this document at the time of its publication. Other documents may supersede this document.
This is the first edition of this requirements document and is intended to provide a starting point for discussion. It is in no way final. This document is a Working Draft for review by ITS mailing list members and other interested parties. It is a draft document and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use this document as reference material or to cite it as other than "work in progress".
This document is for public review, and comments and discussion are welcomed on the public mailing list <firstname.lastname@example.org>. To subscribe, send an email to <email@example.com>. To view archived mail and files visit <http://groups.yahoo.com/group/lisa-its>.
As XML usage grows in various domains, more and more XML documents and applications are localised into different languages and / or locales. In order to localise XML data in a cost-effective and time-efficient manner, a number of conditions must exist. Most of these conditions can be set early in the development of the XML schemas or DTDs. In addition, various guidelines should be followed during the development of the XML content. This document highlights the different requirements to create XML document types and author XML content well-suited for localisation.
Internationalisation activities should always have two aims in mind:
This document places significant emphasis on the second of these points. The localisation process and tools typically receive scant consideration in the creation of XML documents and applications. It is hoped that the localisation community and localisation tools vendors will contribute to the development of this document and its related proposals to ensure that they promote standards of value to that community.
It is hoped that these proposals will provide useful context and support information to localisers, in a standard way, and will promote the use of standard mechanisms for encapsulating information required by translation tools and processes. Such standardisations should result in faster, less labour-intensive, and more powerful localisation processes and tools, which in turn benefit the producers of content by lowering barriers to international deployment of their applications.
The target audience of this document includes the following categories:
This information is also likely to be relevant to developers of new
internet technologies at the World Wide Web Consortium and related
In this document we try not to propose solutions (although it may be hard to avoid some of the more obvious possibilities) - the expectation is that the ITS group will produce a response to this document that proposes best practices or standard tags, etc. The role of this document is to provide some direction and list the issues needing resolution.
As it is unlikely that we will have identified the complete problem set for some time, the intention is for the document to be updated on an ongoing basis as new ideas and issues arise or are proposed. The updating of this document may therefore oocur simultaneously with the development of the response in the early stages. For this period the versioning will be indicated by the date at the beginning of this document. At some point, when it is felt that we have a workable body of issues to address in an initial solution proposal, the numbering and version control will be phased to coincide with the proposal solution documents.
The requirements for localisation-specific information in XML can be defined at several levels.
General requirements may include the following:
It must be possible to signal to the localisation group that a
particular item of content should not be changed during localisation. This may
refer to a single character or a large chunk of data; it may refer to text
data, structural items, or graphic or multimedia entities. The method used
should allow localisation tools to automatically identify and isolate the
specific data in question.
It must be possible to apply this information to any element or range
of text and have that information depends on any other element, attributes, or
There are a number of reasons why it may be necessary to leave particular words, phrases or parts of a document in English during translation.
If text must not be translated for one reason or another it must be possible to indicate this in the XML. The mechanism for this should be defined in the DTD.
Without this, translators may at best waste time deciding what should and should not be translated, and at worst will make mistakes.
No-translate assignments can be particularly useful where machine translation or gisting is involved, since the computer typically has no way of deciding what it would be inappropriate to translate.
Could result in an attribute or tag name that is promoted as a
standard by the localisation industry.
A response to this requirement should consider whether it is appropriate to stipulate that the non-translate / translate setting is inherited - allowing a structural element to apply the property to all contents with minimum intervention from the author.
Is there a requirement for a 'localise' attribute, that indicates that localisation changes should be made that don't entail translation - eg. changing a contact address? This may be helpful to show what must be addressed in post-editing after machine translation has taken place.
An approach should be defined to signal to the localisation group that a particular item of content should not be changed during localisation where this is a function of contextual rules. This may refer to a single character or a large chunk of data; it may refer to text data, structural items, or graphic or multimedia entities. The method used should allow localisation tools to automatically identify and isolate the specific data in question.
The following example based on UIML is taken from [XMLI&L]:
<property part-name="Main" name="content">Sample
name="content">Some text to
In this example only the highlighted text in the 3rd and 5th lines
should be translated. The clue to the appropriateness of translation is given
in this case by the value of the
In other schemas it is possible that the content of a particular element should never be translated, due to the nature of the data it contains (eg. a part number).
An element must be provided that behaves like the SPAN element in
HTML 4.0. This will allow ITS requirements (e.g. translatability or language
information) to be ascribed to a range of text that is not bounded by elements
Could result in an attribute or tag name that is promoted as a standard by the localisation industry.
A method must exist for authors to communicate information to localisers about a particular item of content. There should be two such types of information: firstly, notes that must be read before the localiser attempts to localise, and secondly, notes that provide optional background information. Localisation tools must be able to automatically identify and isolate the specific data to which the note refers, and automatically distinguish between the two different types of note.
To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:
This can help translators avoid mistakes or avoid spending time searching for information.
Two types of developer's note are needed:
An alert contains information that the translator MUST read before translating a piece of text. The translation environment must bring this type of note to the attention of the translator before they begin to translate. (For example, an instruction to the translator to leave parts of the text in the source language.)
A description provides useful background information that the translator will refer to only if they wish. (For example, a clarification of ambiguity in the source text). The translation tool would still make this available to the translator, but would not force them to read it before attempting a translation. The translator may only receive an indication that such a note exists and have to take action to view the text.
Could result in an attribute or tag name that is promoted as a standard by the localisation industry.
A schema should ensure that translatable text is stored in elements rather than attributes whenever possible.
If translatable text is provided as an attribute value rather than element content, the following problems may arise:
In the example code below the no-translate flag applies to the content of the element, but not to the title text. The title text may benefit from id-based leveraging, but has no ID. The xml:lang tag, after translation, will only be relevant for the element content, not the title text.
<extract id="0517.1447" translate="no" xml:lang="en"
title="Ambiguous linguistic construct.">The man hit the boy with the stick
in the bathroom.</extract>
In the next example part of the alt text should be left untranslated (the name of the picture), but it is difficult to see how that would be expressed so that a machine translation tool would exhibit the correct behaviour.
<image id="0517.1716" alt-text="Catalog number 123: The Fish
The schema should express the application of emphasis and document conventions to a particular range of content using naming that reflects the intention and that is not tied in any way to presentation.
Formatting of emphasis and document conventions will be applied in different ways by different cultures and writing systems. For example:
Allowing the author to choose
italicise or similar tags for emphasis and style elements can lead
Instead the DTD should provide tag or attribute names such as
A stylesheet should be able to provide a method of applying different presentational mechanisms to a range of text on a language by language basis (ITS does not currently include stylesheet internationalisation in its remit).
It must be possible for localisation tools and localisers to clearly identify and recognise tags whose applicability and extent will need to be changed during the process of localisation. There must be no restrictions on the modification of these tags as the content changes during localisation (ie. changes to the location and extent of the tag in relation to the content as well as duplication and deletion must all be allowable.)
Certain element tags, such as
emphasis, will need to be
manipulated by the translator. For example, underlining will stretch across a
different range of characters, and may need to be removed or doubled up by the
translator within a given sentence.
It is expected (but not necessarily always true) that many such
elements which contain content will be in-line in nature rather than block
elements, since the key driver for their adaptation will be to achieve a
natural sentence syntax in the foreign language. Further, many such tags are
likely to be hooks for the application of formatting style. Examples may
Empty tags also fall into this category. In order to position an empty tag appropriately in the text the translator will usually need to know what it refers to.
Much of the time the translator will simply need to know the meaning of the tags so that they can fit the translation around the protected tag names in the appropriate way. In some cases, achieving a good sentence flow will mean re-ordering multiple tags.
There will also be occasions when the translator wishes to remove or insert new tags because of the requirements of their language. For example, in a particular language, emphasis may be expressed by the language itself rather than by formatting changes and in this case the translator will want to remove the emphasis tag altogether. The converse of this, of course, is that when translating in the other direction the localiser will need to add markup that was not present in the original. Alternatively, some languages may require an additional range of emphasis to be defined within the same sentence.
The recommendations for this section touch on process issues as well as DTD design.
It should be possible to attach a unique identifier to any localisable item of content - be it text, structure or unparsed entity. This id should be completely unique across all documents .
In order to most effectively re-use translated text where content is
re-used (either across update versions or across deliverables) it is necessary
to have a totally unique and eternally persistent id associated with the
element. This id allows the translation tool to correctly
associate source and translated text units with each other prior to
examination for changes.
This approach can be referred to
The potential for re-use of translations is very appealing in terms of productivity and cost savings for product launch.
Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (a.k.a. translation memory) techniques, which simply look for similar source text in the database without being able to tell whether the context of its use is the same.
This change analysis technique has been possible with UI messages in the past, but the introduction of structured XML (and SGML) documents will allow for its use in documents also.
Where text entities will be re-used across products, or where a localisation group is dealing with these ids must be totally unique.
Character encoding should be identified at the top of any parsed entity using the standard method defined for XML. The encoding should be declared for all external parsed entities that will be included in a document.
The encoding of the text in an entity must be declared or predictable in order for the application reading it to understand the contents of that entity - whether this is a document entity or an external parsed entity.
An encoding used to create source text might not support the characters required for translation. If the translated text is in another encoding than the source, it is especially important to declare which encodings are being used where. In addition, if different encodings are being used for external entities, some conversion will be necessary to produce a single encoding throughout the document instance.
If the application cannot automatically apply the correct character set mappings the document will not be readable.
It would be useful to always declare the encoding used at the top of any file, in ASCII, to aid in reading or processing files.
The main language (or languages of a truly multilingual document) must be declared at the beginning of any document, using industry standard approaches. Such declarations should also apply to any external parsed entities that are stored separately.
Any content in another language within a document should be labelled appropriately.
In addition, it must be possible to declare a single document as being composed of multilingual parts of equal standing, ie. the document entity does not represent a single language.
A number of rendering practises will vary according to the locale of the text - i.e. the language and market region. Examples include text expansion, hyphenation, wrapping rules, colour usage, fonts, spell checking, line height and inter-line spacing, quotation marks and other punctuation, etc. For the appropriate presentation to be applied automatically to documents in different languages it is essential to know the language of the text.
It would also be useful to indicate the locale of the document as a whole to facilitate both processing and identification of translated documents during localisation and content management.
It should also be possible to indicate the language of the XML content for any element or range of text where the language differs from that of the document as a whole. Note that this includes graphics, audio and other unparsed entities, which may need labelling for or treatment specific to a given locale.
It must be possible to declare more than just language at the beginning of any document or any external parsed entities that are stored separately. This information may include any combination of language, script usage, geographical area, dialect, or historical period.
Any content within a document which varies from the declaration at the head of the document should be labelled appropriately.
In addition, it must be possible to declare a single document as being composed of multicultural parts of equal standing - ie. the document entity does not represent a single culture.
The current system of language identification does allow for an approximation to 'locales' by appending country codes to the language codes, but there are difficulties with this classification system that are already being encountered in localisation. For example: how does one distinguish in a standard way between simplified and traditional Chinese without using codes for China and Taiwan? or describe whether Serbian text is in the Latin vs. cyrillic script? how does one indicate that a voice track is in the language spoken in German-speaking Switzerland rather than the language written there, since one is Schwytzertuutsch and the other is very close to but not the same as 'High German'? how does one indicate that a piece of content is in 'International Spanish'? how does one indicate that this is English as spoken in the time of Chaucer?
Most importantly, how does one do this in a way that lends to tools and systems automatically recognising the labels used in order to apply presentation or processing?
This is an area that cries out for a solution that provides interoperability through standardisation, however the development of locale and script related tag standards is a significant area of study in its own right that is outside the remit of ITS.
Proposals in this area will impact W3C specifications significantly.
Any citation used in text should be identified as such, and should be accompanied by information about the source of the citation. A standard approach should be used to identify the source so that localisation tools can automatically retrieve the information about the source.
Take an example such as the following:
Selecting Initialise Auditron will always
produce a confirmation screen. If OK is selected twice, the
Auditron will be initialised and the account data deleted.
The text 'Initialise Auditron' is a quotation from the user interface. Other types of quotation include mimics of the operating system or application messages. Since these messages have typically already been translated, the translator's job is to locate the actual translation previously supplied so as to maintain consistency.
For example, 'Initialise Auditron' may have been translated in Spanish in any of the following ways:
Any of these are acceptable translations, but the translator must try to choose the exact words used in the context which is being quoted.
This requirement calls for a standard way of delimiting quotes so that change analysis, source matching or other approaches can be used to locate the appropriate translation quickly.
Quotations of user interface messages in documentation text should be implemented in such a way that it is possible to retrieve the actual text from the UI resource database.
This takes the previous idea a little further. If a UI message is quoted in the documentation, it is likely to improve productivity and quality of localisation to pull the translated text directly from the UI database, rather than asking the translator to type it in again.
In this case, the example given immediately above may look something more like the following:
Selecting <ui-message name="Initialise Auditron" id="msg123"
/> will always produce a confirmation screen. If <ui-message name="OK"
id="msg124" /> is selected twice, the Auditron will be initialised and the
account data deleted.
Where fixed sizes are used for containers (such as tables, table cells, frames, buffers, screens, etc.) a standard method should be used for indicating the dimensions of the container so that localisation tools can automatically recognise them.
This helps localisers ensure that content will fit as text expands in translation or if graphics need to be adapted.
that are not associated with DTDs or schemas should not use
element or attributes names that are dynamically created.
For example, the following XML excerpt which has automatically generated element names is not very conducive to localization because most translation tools will be unable to deal with it efficiently. This is because the translation tools do not know what the element represents, and therefore how to deal with it.
<message001>Root path: </message001>
<message002>Display Options: </message002>
Where repertoire restrictions apply, there should be a means of indicating the range of characters that can be used in a local version of a document.
For example, a document may contain UI strings for a firmware application where the character set is limited and allow only a small sub-set of Unicode.
It should be possible to indicate that a given element or span of text is a term.
The capability to specify terms within the source content is of great interest for various translation and terminology-related tools. It facilitates, for example, the creation of glossaries and indexes, and allows terminology validation between source and target documents.
There should be a means of indicating whether an element is equivalent or not to a unit that will be used for automated translation processing. Some elements may contain other elements which are translation units in their own right.
Documents are organized in elements containing text, elements or both (mixed content). Identifying the type of each element is important for the translation tools because they base the segmentation of the text on these properties. It is also necessary to have provision for identifying an element inside a mixed content that is a segment by itself. [RI: note that we should probably tease out the idea that segments may not be equivalent to elements - eg. a number of sentences may be included within a single para element. I assume that we are not planning to try to segment at this level, are we? Although perhaps some automated process run on the XML file prior to localisation may do something of the sort???]
For example in XHTML,
<td> have mixed content, while
<small>, etc. are inline elements. At the same time a quote
within a paragraph may be a subflow-type element.
A property to indicate whether an element can mark the end of a word is necessary for tools to get accurate word counts.
For example in XHTML,
<br/> is a word-breaking
<big> are not.
It must be possible to specify whether a given element allows white spaces to be collapsed during translation .
xml:space="preserve" attribute allows to specify that
the white spaces must be preserved, but this properties should be available at
the document type level.
Knowing whether the white spaces in a given element are collapsible or not is important for proper matching when using translation memories tools.
Unicode formatting and control code characters should only be used when markup is not appropriate.
For example, there are Unicode control characters that allow the user to control bidirectional formatting of Arabic and Hebrew, but it is better to use markup to achieve this behaviour. There are some Unicode characters, however, that should be used for controlling format.
The guidelines for this requirement should be provided by
the document, 'Unicode in XML and other Markup Languages', ....
which is a Unicode TR and a W3C note.
Markup should be available to support the required behaviours of all scripts.
This covers behaviours that are not common to all scripts. For example, ruby support may be needed for Far Eastern documents, and bidirectional control tags are needed for Middle Eastern documents. These should be incorporated into the schema if it is to be properly internationalised.
The W3C specifications can provide the lead here.
[RI: There may also be aspects relating to the implementation of apparently standard features which bear investigation - for example, is there a recommended way of implementing lists given that numbering systems may vary widely in different areas - ie. the list type properties may need to be defined, but also a localisable approach may need to be taken to allow the application of such.]
The DTD should provide support for
stylesheets, templates and user interfaces.
Stylesheets often contain text which is presentational in nature, for example a 'Warning' title for warning text. If this text is in the stylesheet it can create difficulties for localisation. A better approach is to gather all such text into an XML file and refer to the appropriate piece of text from the stylesheet.
An externalised UI string may contain embedded elements, such as a variable that refers to a figure number. Such variables will need to be equipped with all necessary attribute values to convey information to the translator - such as what the variable stands for, whether or not it truncates and if so its minimum length, otherwise its maximum length. There must also be a unique identifier to allow the stylesheet to provide the value of the variable. [RI: need to develop this further - may lead to additional requirements or at least an amplification of the current requirement.]
Externalised strings should be accompanied by elements (designer's notes) that describe the use of the string and unique ids that the stylesheet will use to reference them. They may also be grouped.
This document was developed with contributions from the following people: