[This local archive copy mirrored from: http://www.lisa.org/tmx/tmx.htm; see the canonical version of the document.]

TMX format   TMX Format

LISA SIG

Specifications

Version 1.0 - Nov-25-1997 - Last edit: Nov-25-1997


Summary of purpose

This document describes the TMX file format. TMX stands for Translation Memory eXchange. OSCAR (Open Standards for Container/Content Allowing Re-use) is the LISA Special Interest Group responsible for its definition.

Content


1. Overview

The purpose of TMX is to allow easier exchange of translation memory data between tools and/or translation vendors with little or no loss of critical data during the process.

TMX is defined in two parts:

For the time being, this document addresses only the first part.


2. Specifications

2.1. SGML/XML Compliance

TMX is XML-compliant (and therefore SGML-compliant as well). It also uses various ISO standards for date/time, language codes, and country codes. (See References section.)

TMX files are intended to be created automatically by export routines and processed automatically by import routines. TMX files are "well-formed" XML documents that can be processed without explicit reference to the TMX DTD. However, a "valid" TMX file must conform to the TMX DTD, and any suspicious TMX file should be validated against the TMX DTD using a general-purpose XML parser.

XML "well-formed" documents may start with the XML processing statement, but it is not required.

2.2. Code Sets

TMX files are always in a Unicode code set. They can use either of two encodings: UCS-2 (16-bit files) or ISO-646 [ASCII] (7-bit files). In both cases only the following five character entities are allowed: &amp; (&), &lt; (<), &gt (>), &apos; ('), and &quot; ("). For 7-bit files, extended (non-ASCII) characters are represented by numeric character references using the Unicode hexadecimal values (e.g. &#x0396; for a Greek capital letter delta). Since all XML files are, by default, in the UTF-8 encoding of Unicode, TMX files should, in order to ensure correct XML parsing, begin with an XML processing instruction containing an explicit value for the encoding attribute (UCS-2 or ISO-646).

UCS-2 files always start with a Unicode byte-order-mark value, the ZERO WIDTH NO-BREAK SPACE 0xFEFF. Like other XML files, TMX files can contain Unicode surrogates to access some of the ISO-10646 code set values outside the Basic Multilingual Plane.

2.3. Element Definitions

The following table lists the different elements of a TMX document.

<TMX> The <TMX> element contains one <HEADER> element followed by one <BODY> element.
It has one mandatory attribute: VERSION.
<HEADER> The <HEADER> element contains zero, one or more <META/> elements; zero, one or more <NOTE> elements, zero, one or more <UDE> and zero, one or more <PROP> elements.
It has four mandatory attributes: CREATIONTOOL, SEGTYPE, O-TMF and DATATYPE. It has several optional attributes: O-ENCODING, CREATIONDATE, CREATIONID, CHANGEDATE, CHANGEID, ADMINLANG and SRCLANG.
<META/> A <META/> element is empty (i.e., it has no content and no end tag).
It does have, however, two mandatory attributes: NAME and REF and two optional attributes: LANG and O-ENCODING.
Each <META/> element specifies tool/format-related data that are not defined by the standard and that are stored in an external location.
<PROP> A <PROP> (Property) element contains no other elements.
It has one mandatory attribute: NAME and two optional attributes: LANG and O-ENCODING.
The <PROP> elements are used to define the various properties of the parent element (or the file when <PROP> is used in a <HEADER> element). For example: Domain, Projects, Status, RTFPreamble, etc. These properties are not defined by the standard. Each tool provider should publish the different properties it uses. Some other properties are user-defined.
<UDE> A <UDE> (User-Defined Encoding) element contains one or more <MAP> elements.
It has one mandatory attribute: NAME.
It is used to specify a set of user-defined characters and/or, optionally their mapping from Unicode to the user-defined encoding.
<MAP/> A <MAP/> element is empty (i.e., it has no content and no end tag).
It has one mandatory attribute: UNICODE and several optional attributes: CODE, ENT and SUBST.
It is used to specify a user-defined character and some of its properties.
<BODY> The <BODY> element encloses the main data, the set of <TU> that composes the file.
It has no attribute.
<TU> Each <TU> (Translation Unit) element contains zero, one or more <NOTE> elements, followed by zero, one or more <PROP> elements, followed by one or more <TUV> elements. The first <TUV> element in a <TU> is expected to be the source.
Logically, a translation-memory database is not complete unless there are at least two <TUV> elements in most Translation Units.
Each <TU> has several optional attributes: ID, O-ENCODING, DATATYPE, USAGECOUNT, LASTUSAGEDATE. CREATIONTOOL, CREATIONDATE, CREATIONID, CHANGEDATE, SEGTYPE, CHANGEID and SRCLANG.
<TUV> Each <TUV> (Translation Unit Variant) specifies a text in a given language. It contains zero, one or more <NOTE> elements, followed by zero, one or more <PROP> elements, followed by one <SEG> element.
It has one mandatory attribute: LANG and several optional attributes: O-ENCODING, DATATYPE, USAGECOUNT, LASTUSAGEDATE, CREATIONTOOL, CREATIONDATE, CREATIONID, CHANGEDATE and CHANGEID.
<SEG> Each <SEG> (Segment) contains the text of the <TUV>.
It has no attributes.
All spacing characters and line-breaks are significant inside a <SEG> element.
<NOTE> A <NOTE> element is used for comments. It contains no other element.
It has two optional attributes: O-ENCODING and LANG.

2.4. Attribute Definitions

The following table lists the different attributes used in the elements of a TMX document. The same attribute may be mandatory or optional depending on the element.

Note that a few attributes are not yet completely defined.

CREATIONTOOL The CREATIONTOOL attribute identifies the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider will publish the string identifier it uses.
CREATIONDATE The CREATIONDATE attribute specifies the date of the creation of the element. Its value must be in ASCII, in the format YYYYMMDDThhmmss. (e.g. 19970811T133402 for August 11th 1997 at 1:34pm 2 seconds.) Time is given in UTC.
CREATIONID The CREATIONID attribute specifies the user who created the element.
CHANGEDATE The CHANGEDATE attribute specifies the date of the modification of the element. Its value must be in ASCII, in the format YYYYMMDDThhmmss. (e.g. 19970811T133402 for August 11th 1997 at 1:34pm 2 seconds.) Time is given in UTC.
CHANGEID The CHANGEID attribute specifies the user who modified the element.
O-ENCODING As stated in Section 2.2, all TMX files are in Unicode. However, it is sometimes useful to know what codeset was used to encode text that was converted to Unicode for purposes of interchange. The O-ENCODING attribute specifies the original or preferred code set of the data of the element in case they are to be encoded in a non-Unicode code set. Its value, when possible, should be one of the SGML/HTML recommended code set identifiers.
O-TMF The O-TMF (Original Translation Memory Format) element specifies the format of the Translation Memory file from which the TMX document has been generated.
LANG The LANG attribute specifies the language or the locale of the data of the element. In the <NOTE> <META/>and <PROP> elements, the default value for the LANG attribute is the same as the ADMINLANG attribute in the <HEADER> element. The value of the LANG attribute must be one of the ISO language identifiers (2 or 3-letter code) or one of the standard locale identifiers (2 or 3-letter language code, dash, 2-letter region code).
DATATYPE The DATATYPE attribute specifies the type of data of an element. Its possible values are not yet part of the standard.
SRCLANG The SRCLANG attribute specifies the language or locale of the source text. Its value must be one of the values used by a LANG attribute.
ADMINLANG The ADMINLANG attribute is used in the <HEADER> element to specify the default language for the administrative and informative elements <NOTE>, <META/> and <PROP>.
NAME The NAME attribute specifies the type of information of a <META/> or <PROP> element, or the name of a <UDE> element. Its value is not defined by the standard, but tools providers will publish the values they use.
REF The REF attribute is used to specify the external reference document of the <META/> element. The type of document is specified by the NAME attribute.
ID The ID attribute specifies an identifier for the <TU> element. Its value is not defined by the standard (it could be unique or not, numeric or alphanumeric, etc.).
USAGECOUNT The USAGECOUNT attribute specifies the number of times a <TU> or <TUV> has been used.
LASTUSAGEDATE The LASTUSAGEDATE attribute specifies when the last time a <TU> or <TUV> has been used. Its value must be in ASCII, in the format YYYYMMDDThhmmss. (e.g. 19970811T133402 for August 11th 1997 at 1:34pm 2 seconds.) Time is given in UTC.
VERSION The VERSION attribute indicates in which version of TMX format the document is.
UNICODE The UNICODE attribute specifies the Unicode character value of a <MAP/> element. Its value must be a valid Unicode value (including the Private Use area) in hexadecimal format (e.g. UNICODE="#xF8FF").
CODE The CODE attribute specifies the code-point value in a user-defined encoding corresponding to the UNICODE character of a given <MAP/> element. Its value must be in hexadecimal format (e.g. CODE="#x9F").
ENT The ENT attribute specifies the entity name of the character defined by a given <MAP/> element. Its value must be in ASCII.
SUBST The SUBST attribute lets you specify an alternative string for the character defined in a given <MAP/> element. Its value must be in ASCII (e.g. "(c)" for the copyright sign).
SEGTYPE The SEGTYPE attribute specifies the kind of segmentation used in the <TU> elements. Its value must be either "paragraph", "sentence" or "phrase". If a <TU> does not have a SEGTYPE attribute specified, it is of the type defined in the <HEADER> element.

2.5. Document Type Definition

The DTD for TMX is described in the tmx.dtd document.


3. Sample

Notational conventions: The restrictions on the number of occurrences of each element and whether an attribute is mandatory within an element are indicated by:

This is an example of a TMX file. (The indentations are only there for ease of reading). Different types of notation are mixed to illustrate the various possibilities.

<?XML VERSION="1.0" ENCODING="ISO-646" ?>
<TMX VERSION="1.0" >
   <HEADER
      CREATIONTOOL="XYZTool v1.01-023",
      DATATYPE="Text",
      SEGTYPE="sentence",
      O-TMF="ABCTransMem",
      CREATIONDATE="19970101T163812",
      CREATIONID="ThomasJ",
      CHANGEDATE="19970314T023401",
      CHANGEID="AlbertA",
      ADMINLANG="EN",
      SRCLANG="EN"
   >
      <NOTE>This is a note at document level.</NOTE>
      <META
         NAME="ExternalData",
         REF="data.txt"
      />
      <PROP NAME="RTFPreamble">{\rtf1\ansi\tag etc...{\fonttbl}</PROP>
      <UDE NAME="MacRoman">
         <MAP UNICODE="#xF8FF" CODE="#xF0" ENT="Apple_logo" SUBST="[Apple]"/>
      </UDE>
   </HEADER>

   <BODY>
      <TU
         ID="0001",
         DATATYPE="Text",
         USAGECOUNTER="2",
         LASTUSAGEDATE="19970314T023401"
      >
         <NOTE>Text of a note at the TU level.</NOTE>
         <PROP NAME="Domain">Computing</PROP>
         <PROP NAME="Project">P&#x00E6;gasus</PROP>
         <TUV
            LANG="EN",
            CREATIONDATE="19970212T153400",
            CREATIONID="BobW"
         >
            <SEG>data (with a non-standard character: &#xF8FF;).</SEG>
         </TUV>
         <TUV
            LANG="FR-CA",
            CREATIONDATE="19970309T021145",
            CREATIONID="BobW"
            CHANGEDATE="19970314T023401",
            CHANGEID="ManonD",
         >
            <PROP NAME="Origin">MT</PROP>
            <SEG>donn&#xE9;es (avec un charact&#x00E8;re non standard: &#xF8FF;).</SEG>
         </TUV>
      </TU>
      <TU ID="0002">
         <PROP NAME="Domain">Cooking</PROP>
         <TUV LANG="EN">
            <SEG>menu</SEG>
         </TUV>
         <TUV LANG="FR-CA">
            <SEG>menu</SEG>
         </TUV>
         <TUV LANG="FR-FR">
            <SEG>menu</SEG>
         </TUV>
      </TU>
   </BODY>

</TMX>

4. Glossary

DTD
An SGML document has an associated Document Type Definition (DTD) that specifies the rules for the structure of the document. Several industries have standardized on various DTDs for the different types of documents that they share.
SGML
SGML stands for Standard Generalized Markup Language. An ISO standard (ISO-8879) allowing the definition of structured formats. SGML is not a format by itself, but a set of rules to define formats. SGML formats are defined in Document Type Definition files (DTDs).
UCS-2
UCS-2 is a 16-bit fixed-length encoding scheme of the Unicode character set.
XML
XML stands for Extensible Markup Language. XML is a simplified and restricted subset of SGML.
UTC
UTC stands for Coordinated Universal Time.

5. References

The following links are useful references for implementing TMX, SGML and XML-related applications.