TMX Format |
Version 1.3 - August-29-2001
This document describes the TMX file format. TMX stands for Translation Memory eXchange. OSCAR (Open Standards for Container/Content Allowing Re-use) is the LISA Special Interest Group responsible for its definition.
See also:
The purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process.
TMX is defined in two parts:
TMX can be implemented on three levels:
TMX is XML-compliant. It also uses various ISO standards for date/time, language codes, and country codes. (See References section.)
TMX files are intended to be created automatically by export routines and processed automatically by import routines. TMX files are "well-formed" XML documents that can be processed without explicit reference to the TMX DTD. However, a "valid" TMX file must conform to the TMX DTD, and any suspicious TMX file should be verified against the TMX DTD using a validating XML parser.
Since XML syntax is case sensitive, any XML application must define casing conventions. All elements and attributes names of TMX are defined in lower-case.
The TMX namespace is defined as http://www.lisa.org/tmx. For example, if you want to use TMX fragments in another XML document you document will look something like:
<?xml version="1.0" ?> <myformat> <data> <tmx xmlns="http://www.lisa.org/tmx"> ... TMX data </tmx> </data> </myformat>
TMX files are always in Unicode. They can use either of three encoding methods: UCS-2 (16-bit files), UTF-8 (8-bit files) or ISO-646 [US-ASCII] (7-bit files). In both cases only the following five character entities are allowed: & (&), < (<), > (>), ' ('), and " ("). For 7-bit files, extended (non-ASCII) characters are always represented by numeric character references using the Unicode hexadecimal values (e.g. Ζ for a GREEK CAPITAL LETTER DELTA).
Since all XML processors must accept the UTF-8 and UTF-16 encodings and since US-ASCII and UCS-2 encoding methods are, respectively, sub-sets of UTF-8 and UTF-16, a TMX document can omit the encoding declaration in the XML declaration.
Note that UCS-2 files always start with a Unicode byte-order-mark value, the ZERO WIDTH NO-BREAK SPACE 0xFEFF.
The following table lists the different elements of a TMX document (Container):
<tmx> | The <tmx> element contains one <header>
element followed by one <body> element.
Mandatory attribute: version. |
<header> | The <header> element contains zero, one or more
<note> elements; zero, one or more
<ude> elements; and zero, one or more
<prop> elements.
Mandatory attributes: creationtool, creationtoolversion, segtype, o-tmf, adminlang, srclang and datatype. Optional attributes: o-encoding, creationdate, creationid, changedate and changeid. |
<prop> | A <prop> (Property) element contains no other elements. The <prop>
elements are used to define the various properties of the parent element
(or of the file when <prop> is used in the <header>
element). These properties are not defined by the standard. Each tool
provider should publish the different properties types it uses. If the
tool exports un-published properties types, their values should begin
with the prefix "x-".
Mandatory attribute: type. Optional attributes: xml:lang and o-encoding. |
<ude> | A <ude> (User-Defined Encoding)
element contains one or more <map/> elements.
It is used to specify a set of user-defined characters and/or,
optionally their mapping from Unicode to the user-defined encoding.
Mandatory attributes: base (if one or more of the <map/> elements contains a code attribute) and name. |
<map/> | A <map/> element is empty (i.e., it
has no content and no end tag). The <map/> element is used to
specify a user-defined character and some of its properties.
Mandatory attribute: unicode. Optional attributes: code, ent and subst. Note that at least one of these attributes should be specified. If the code attribute is specified, the parent <ude> element must specify a base attribute. |
<body> | The <body> element encloses the
main data, the set of <tu> elements that are
comprised within the file.
Mandatory attributes: none. Optional attributes: none. |
<tu> | Each <tu> (Translation Unit) element contains zero, one or
more <note> elements or <prop>
elements, followed by one or more <tuv>
elements. Logically, a complete translation-memory database will contain
at least two <tuv> elements in each
Translation Unit.
Mandatory attributes: none. Optional attributes: tuid, o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, segtype, changeid, o-tmf and srclang. |
<tuv> | Each <tuv> (Translation Unit Variant) element specifies text
in a given language. It contains zero, one or more <note>
elements or <prop> elements, followed by one
<seg> element.
Mandatory attribute: xml:lang. Optional attributes: o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, o-tmf, and changeid. |
<seg> | Each <seg> (Segment) element
contains the text of the <tuv> element. It
contains zero, one or more <bpt> elements; the
same number of corresponding <ept> elements;
zero, one or more <it> elements; zero, one or
more <ph> elements; and zero, one or more
<ut> elements. All spacing characters and
line-breaks are significant inside a <seg> element. It has no
length limitation.
Mandatory attributes: none. Optional attributes: none. |
<note> | A <note> element is used for comments. It contains no other
element.
Mandatory attributes: none. Optional attributes: o-encoding and xml:lang. |
The following table lists the different elements of a TMX document (Content):
<bpt> | The <bpt> (Begin paired tag) element contains zero, one or
more <sub> elements. It is used to delimit the
beginning of a paired sequence of native codes. Each <bpt> has a
corresponding <ept> element within the
segment.
Mandatory attributes: i. Optional attributes: typeand x. |
<ept> | The <ept> (End paired tag) element contains zero, one or more
<sub> elements. It is used to delimit the end
of a paired sequence of native codes. Each <ept> element has a
corresponding<bpt> element within the segment.
Mandatory attributes: i. Optional attribute: none. |
<sub> | The <sub> (Sub-flow) element contains zero, one or more
<bpt> elements; the same number of
<ept> elements; zero, one or more
<it> elements; zero, one or more <ut>
elements; zero, one or more <hi> elements; and zero, one or more <ph> elements.
It is used to delimit sub-flow text inside a sequence of native code,
for example: the definition of a footnote or the text of a title in a
HTML anchor element.
Mandatory attributes: none. Optional attributes: datatype and type. |
<it> | The <it> (Isolated tag) element contains zero, one or more
<sub> elements. It is used to delimit a
beginning/ending sequence of native codes that does not have its
corresponding ending/beginning within the segment.
Mandatory attribute: pos. Optional attributes: type and x. |
<ph> | The <ph> (Place holder) element contains zero, one or more
<sub> elements. It is used to delimit a
sequence of native stand-alone codes in the segment.
Mandatory attributes: none. Optional attributes: type, x and assoc. |
<ut> | The <ut> (Unknown tag) element contains no other elements. It
is used to delimit a sequence of native codes, about which the exporter
has no information.
Mandatory attributes: none. Optional attribute: x. |
<hi> | The <hi> (Highlight) element contains zero, one or more
<bpt> elements; the same number of
<ept> elements; zero, one or more
<it> elements; zero, one or more <ut>
elements; zero, one or more <hi> elements; and zero, one or more <ph> elements.
It is used to delimit a portion of the segment for any user-defined
purpose. Mandatory attribute: none Optional attribute: type, x Version: 1.2 and after |
The following table lists the different attributes used in the elements of a TMX document. The same attribute may be used with multiple elements, but will be either mandatory or optional depending on the specific occurrence.
adminlang | The adminlang attribute is used in the <header> element to specify the default language for the administrative and informative elements <note> and <prop>. Its value must be one of the values used by a xml:lang attribute. |
assoc | The assoc attribute (Association) is used to define whether an <ph> element is associated with the previous or the following text. Its value must be "p" (previous), "f" (following), or "b" (both). |
base | The base attribute specifies the code set upon which the re-mapping of the<ude> element is based. Its value should follow the same rules as the value of an o-encoding attribute. |
changedate | The changedate attribute specifies the date of the modification of the element. Its value must be in ASCII, in the format YYYYMMDDThhmmssZ. (e.g. 19970811T133402Z for August 11th 1997 at 1:34pm 2 seconds.) This is one of the options described in ISO 8601:1988. The value is always given in UTC (as indicated by the terminal Z). |
changeid | The changeid attribute specifies the user who modified the element. |
code | The code attribute specifies the code-point value in a user-defined encoding corresponding to the unicode character of a given <map/> element. Its value must be in hexadecimal format (e.g., code="#x9F"). |
creationdate | The creationdate attribute specifies the date of the creation of the element. Its value must be in ASCII, in the format YYYYMMDDThhmmssZ. (e.g. 19970811T133402Z for August 11th 1997 at 1:34pm 2 seconds.) This is one of the options described in ISO 8601:1988. The value is always given in UTC (as indicated by the terminal Z). |
creationid | The creationid attribute specifies the user who created the element. |
creationtool | The creationtool attribute identifies the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider will publish the string identifier it uses. |
creationtoolversion | The creationtoolversion attribute identifies the version of the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider will publish the string identifier it uses. |
datatype | The datatype attribute specifies the type of data contained in an element. Its default value is "unknown". See the recommended values section for more information. |
ent | The ent attribute specifies the entity name of the character defined by a given <map/> element. Its value must be in ASCII (e.g., ent="copy"). |
i | The i attribute (Internal matching) is used in the content markup
to pair the <bpt> elements with <ept>
elements. This mechanism provides TMX with support to markup a possibly
overlapping range of codes, such as: "<B>Bold <I>Bold+Italic</B>
Italics</I>" . |
lang | DEPRECATED attribute since version 1.3 : use xml:lang instead. The lang attribute specifies the language or the locale of the data of the element. In the <note> and <prop> elements, the default value for the lang attribute is the same as the adminlang attribute in the <header> element. The value of the lang attribute must be one of the ISO language identifiers (2 or 3-letter code) or one of the standard locale identifiers (2 or 3-letter language code, dash, 2-letter region code). |
xml:lang | The xml:lang attribute specifies the language or the locale of the data
of the element. In the <note> and
<prop> elements, the default value for the
xml:lang attribute is the same as the
adminlang
attribute in the <header> element. The
value of the xml:lang attribute must be one of values defined by the XML
specifications for this attribute. Note that the xml:lang value is case
insensitive. Starting from TMX version 1.3 this attribute replaces the deprecated attribute lang. TMX applications supporting version 1.3 should always use xml:lang for output, but should interpret the lang attribute as xml:lang in input. If, by accident, both attributes are present for a given element and have different values, xml:lang takes precedence. |
lastusagedate | The lastusagedate attribute specifies when the last time the content of a <tu> or <tuv> element was used in the original translation memory environment. Its value must be in ASCII, in the format YYYYMMDDThhmmssZ. (e.g. 19970811T133402Z for August 11th 1997 at 1:34pm 2 seconds.) This is one of the options described in ISO 8601:1988. The value is always given in UTC (as indicated by the terminal Z). |
name | The name attribute specifies the name of a <ude> element. Its value is not defined by the standard, but tools providers will publish the values they use. |
o-encoding | As stated in Section 2.2, all TMX files are in Unicode. However, it is sometimes useful to know what code set was used to encode text that was converted to Unicode for purposes of interchange. The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set. Its value, when possible, should be one of the IANA recommended code set identifiers. |
o-tmf | The o-tmf (Original Translation Memory Format) element specifies the format of the Translation Memory file from which the TMX document or segment thereof have been generated. |
pos | The pos attribute (Position) specifies that an <it> element is actually the beginning or the end part of a paired code that has no correspondence in the segment. Its value must be an empty string, "begin" or "end". |
segtype | The segtype attribute specifies the kind of segmentation used in
the <tu> element. Its value must be either "block",
"paragraph", "sentence" or "phrase". If a<tu>
element does not have a segtype attribute specified, it is of the type
defined in the <header> element.
See the Implementation Notes for examples of how to use segtype. |
srclang | The srclang attribute specifies the language or locale of the source text. Its value must be one of the values used by a xml:lang attribute or the value "*all*" to indicate that any language combination can be used. |
subst | The subst attribute specifies an alternative string for the character defined in a given <map/> element. Its value must be in ASCII (e.g., "(c)" for the copyright sign). |
tuid | The tuid attribute specifies an identifier for the <tu> element. Its value is not defined by the standard (it could be unique or not, numeric or alphanumeric, etc.). |
type | The type attribute specifies the kind of data a <prop>, <bpt>, <ph>, <hi>, <sub> or <it> element represents. See the recommended values section for more information. |
unicode | The unicode attribute specifies the Unicode character value of a <map/> element. Its value must be a valid Unicode value (including the Private Use area) in hexadecimal format (e.g., unicode="#xF8FF"). |
usagecount | The usagecount attribute specifies the number of times the <tu> or the content of the <tuv> element has been accessed in the original TM environment. |
version | The version attribute indicates the version of the TMX format to which the document conforms. |
x | The x attribute (External matching) is used in the content markup to match <bpt>, <ph>, <it>, <ut> and <hi> elements between each <tuv> element of a given <tu> element. This mechanism facilitates the pairing of allied codes in source and target text, even if the order of code occurrence differs between the two. |
By using standard values for attributes, the TMX format can minimize the amount of data lost during the exchange process. However, this dynamic nature of the diverse collection of data that needs to be captured doesn't lend itself to being part of the TMX format specification. By specifying the recommended values for attributes in the accompanying Implementation Notes, developers of translation memory tools can update this information on an on-going basis without initiating a revision of the TMX format specification itself.
The TMX specification strongly recommends that developers of TMX-aware tools use only recommended attribute values when writing TMX data in order to ensure full TMX compliance. The Implementation Notes document specifies recommended values for the datatype and type attributes.
Each TM system uses a different method of marking up the formatting. Formats are constantly evolving, and new formats will be introduced on a regular basis. Attempting to collect, interpret, disseminate and maintain finite descriptions of each formatting tag used at any given time by any of the TM systems is not possible.
The best way to deal with these native codes is to delimit them by a specific set of elements that convey where they begin and end, and possibly additional information about what they are (bold, italic, footnote, etc.).
Native codes can be grouped into four categories:
Respectively, the TMX vocabulary provides elements to mark up each category of native code sequences:
An additional element (<sub>) is provided to delimit sub-flow text within a sequence of native codes. For example, if the text content of a footnote is defined within the footnote marker code, it may be demarked with the <sub> element.
For example:
Without Content mark-up tags: <seg>Text in {\i italics}.</seg> With Content mark-up tags (content markup in bold red): <seg>Text in <bpt type="italic">{\i </bpt>italics<ept>}</ept>.</seg>
Such a mechanism allows tools to perform matching at several levels:
For example, here are four segments differing only by the formatting codes:
Plain text: Special text RTF v1: {\b Special} text RTF v2: {\cf7 Special} text HTML: <B>Special</B> text
The same samples with the TMX content mark-up tags:
Plain text: <seg>Special text</seg> RTF v1: <seg><bpt type="bold">{\b </bpt>Special<ept>}</ept> text</seg> RTF v2: <seg><bpt>{\cf7 </bpt>Special<ept>}</ept> text</seg> HTML: <seg><bpt type="bold"><B></bpt>Special<ept></B></ept> text</seg>
The datatype attribute is used to specify the kind of native code the data contains.
TMX implements a mechanism to help you match codes between source and target text. The x attribute in the <bpt>, <it>, <ut> and <ph> elements allows you to pair codes between two <tuv> elements (even if they are not in the same order any more because of the translation syntax). For example:
<seg>The <bpt x="1">{\b </bpt> black<ept>}</ept><bpt x="2">{\i </bpt> cat<ept>}</ept> sleeps.</seg> <seg>Le<bpt x="2">{\i </bpt> chat<ept>}</ept> <bpt x="1">{\b </bpt> noir<ept>}</ept> dort.</seg>
TMX provides a way to deal with overlapping tags. Such constructions are not used often, however several formats allow them. For example, the following HTML segment, even if not strictly legal, is accepted by some HTML editors and usually interpreted correctly by the browsers.
HTML: <B>Bold, <I>Bold+Italic</B>, Italic</I> TMX (without content mark-up): <seg><B>Bold, <I>Bold+Italic</B>, Italic</I></seg>
With the TMX content mark-up, since the <ept> element does not necessarily have a type, it can be difficult to know which sequence of codes it closes as illustrated by the following segment:
TMX (with basic content mark-up): <seg><bpt><B></bpt>Bold, <bpt><I></bpt>Bold+Italic<ept></B></ept>, Italic<ept></I></ept></seg>
The attribute i is used to specify which <ept> is closing which <bpt>.
TMX (with correct content mark-up): <seg><bpt i="1"><B></bpt>Bold, <bpt i="2"><I></bpt>Bold+Italic<ept ="1"></B></ept>, Italic<ept i="2"></I></ept></seg>
See the Implementation Notes for more details.
Notational conventions: The restrictions on the number of occurrences of each element and whether an attribute is mandatory within an element are indicated by:
This is an example of a TMX file. (The indentations are only there for ease of reading). Different types of notation are mixed to illustrate the various possibilities.
<?xml version="1.0" ?> <!DOCTYPE tmx SYSTEM "tmx13.dtd"> <!-- Example of TMX document --> <tmx version="1.3"> <header creationtool="XYZTool" creationtoolversion="1.01-023" datatype="PlainText" segtype="sentence" adminlang="en-us" srclang="EN" o-tmf="ABCTransMem" creationdate="19970101T163812Z" creationid="ThomasJ" changedate="19970314T023401Z" changeid="Amity" o-encoding="iso-8859-1" > <note>This is a note at document level.</note> <prop type="RTFPreamble">{\rtf1\ansi\tag etc...{\fonttbl}</prop> <ude name="MacRoman" base="Macintosh"> <map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/> </ude> </header> <body> <tu tuid="0001" datatype="Text" usagecount="2" lastusagedate="19970314T023401Z" > <note>Text of a note at the TU level.</note> <prop type="x-Domain">Computing</prop> <prop type="x-Project">Pægasus</prop> <tuv xml:lang="EN" creationdate="19970212T153400Z" creationid="BobW" > <seg>data (with a non-standard character: ).</seg> </tuv> <tuv xml:lang="FR-CA" creationdate="19970309T021145Z" creationid="BobW" changedate="19970314T023401Z" changeid="ManonD" > <prop type="Origin">MT</prop> <seg>données (avec un caractère non standard: ).</seg> </tuv> </tu> <tu tuid="0002" srclang="*all*"> <prop type="Domain">Cooking</prop> <tuv xml:lang="EN"> <seg>menu</seg> </tuv> <tuv xml:lang="FR-CA"> <seg>menu</seg> </tuv> <tuv xml:lang="FR-FR"> <seg>menu</seg> </tuv> </tu> </body> </tmx>
The following links are useful references for implementing TMX, SGML and XML-related applications.
The most up-to-date version of this document can be obtained on the LISA Web site at http://www.lisa.org/tmx/.