[Cache from http://www.opentag.com/otspecs.htm; please use this canonical URL/source if possible.]
OpenTag
Format Specifications
Version 1.2 -
Nov-23-1998 - Last edit: Jan-07-2001
Abstract
OpenTagTM is
a format to encode data (mostly text) extracted from an original
file of any format. Its purpose is to allow the extraction of a
document, processing the text in a standard common format, and
then, if needed, merging the text back into its original format.
Contents
See also:
OpenTag is XML
compliant. You can find the latest XML specifications at http://www.w3.org/TR/REC-xml.
The terms ELEMENT,
ATTRIBUTE, VALUE, TAG and CONTENT used in this document and its
collateral's are meant in the sense they are used in a XML/SGML
context. Here are some examples:
Element with content:
<ELEM1 ATTR="value">content</ELEM1>
| | | | |
| | | | +-----> End tag of the element ELEM1
| | | +-------------> Content of the element ELEM1
| | +--------------------> Value of the attribute ATTR
| +--------------------------> Attribute ATTR
+--------------------------------> Start tag of the element ELEM1
Element without content (empty element):
<ELEM2 ATTR="value" />
| | | |
| | | +--------------> Closing marker of the empty element ELEM2
| | +--------------------> Value of the attribute ATTR
| +--------------------------> Attribute ATTR
+--------------------------------> Opening marker of the element ELEM2
An OpenTag file can be
encoded either as an ASCII 7-bit file, a 8-bit file or a UTF-16 16-bit
file. The XML encoding instruction must be specified if the file
is in an encoding different than UTF-8 or UTF-16.
Special care should be
taken when processing text in a multi-byte code set. For
paragraphs were the ws attribute allows the text to be wrapped:
additional line-breaks and spaces must not break characters.
When encoded in 16-bit,
the first two bytes of the file must be the Unicode Byte-Order-Mark
character (0xFFEF).
As any XML document, OpenTag files use
numeric character references (NCRs) to specify the characters that do not exist in the
encoding used. A numeric character reference can be either in hexadecimal or
decimal notation. The hexadecimal notation is &#xHHHH; where HHHH is the
hexadecimal value of the Unicode code point for the given character. The decimal
notation is &#DDDD; where DDDD is the decimal value of the Unicode code
point for the given character.
Example:
<p id="1">Lowercase "a grave" = à = à</p>
Several ASCII
characters need also to be coded with entities to avoid confusion
with OpenTag markers:
- The character < (ASCII 0x3C) should be coded "<"
(or < or <).
- The character & (ASCII 0x26) should be coded "&"
(or & or &).
- The character " (ASCII 0x22)
should be coded """ (or " or ") in
attribute values enclosed between double-quotes.
- The character ' (ASCII 0x27)
should be coded "') (or ' or %) in
attribute values enclosed between single-quotes.
In OpenTag, the
attributes of all structural and informative elements and
delimiter elements can be inherited. If an element does not have
some of its attributes specified, the values for those attributes
are the same as the values of the closest parent element.
For example:
<grp lc="EN" rid="DLG1" id="34">
<grp id="id_23">
<p lc="FR">&Chercher...</p> <!-- inherited: id="id_23" id="43" rid="DLG1" -->
<p>&Find...</p> <!-- inherited: lc="EN" id="id_23" id="43" rid="DLG1" -->
<p lc="SV">&sök...</p> <!-- inherited: id="id_23" id="43" rid="DLG1" -->
</grp>
</grp>
This rule applies for structural
and delimiter elements but does not apply for the in-line
elements.
XML is a case-sensitive
markup. The names of elements and attributes in OpenTag are always
lowercase.
In case OpenTag markup is mixed with
other content types and you need to use a namespace identifier, the URI for
OpenTag is: urn:OpenTag:Version12.
After the XML
processing instruction comes the OpenTag document itself,
enclosed within the <opentag> element. An OpenTag document is composed
of zero, one or more sections, each enclosed within a <file> element.
The XML prologue is
mandatory. It sets the defaults for the encoding of the file. If
the encoding declaration is omitted, the file is assumed to be
either in UCS-2 or UTF-8. The first character of the file must be
the Unicode Byte-Order-Mark if the file is in UCS-2.
<?xml version="1.0" encoding="iso-8859-1" ?>
An ideal minimum
OpenTag document will look something like this:
<?xml version="1.0" ?>
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN" datatype="PlainText" original="file.ext">
<p>Hello Word!</p>
</file>
</opentag>
OpenTag elements can be
divided into three main categories: the structural and
informative elements, the in-line elements and the delimiter
elements. Attributes are shared among them.
The structural and
informative elements |
<csdef>, <file>, <grp>, <map/>, <note>, <opentag>, <p>, and
<prop>. |
In-line elements |
<ct>, <g>, <ix/>, <ixd>, <lvl>, <ocs>, <rf/>, <so> <tx>, and <x/> |
Delimiter elements |
<mrk>, and <s>. |
Attributes |
base, case, cs, code, comp, cond, coord, datatype, date, ent, font, id, lc, name, original, reference, rid, seg, subst, tool, ts, type, ucode, var, version, and
ws. |
The structural elements
specify the frame of an OpenTag document as well as contextual
and processing information. The <p> element contains the extracted
data and, possibly, in-line elements.
<opentag> |
OpenTag document - The
<opentag> element encloses all the other elements
of the document. |
Mandatory attributes: |
version. |
Optional attributes: |
None. |
Contents: |
One or more <file> elements. |
<?XML version="1.0"?>
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN"
datatype="PlainText"
original="file.ext">
<p>Hello Word!</p>
</file>
</opentag>
|
OpenTag document with the minimal
structure. |
<file> |
File - The <file>
element corresponds to a single extracted original
document. |
Mandatory attributes: |
tool, datatype, original, lc. |
Optional attributes: |
reference,
date, type, ws, ts. |
Contents: |
Zero, one or more <csdef/> elements, followed by
zero, one or more of the following elements: <prop>, <note>, <grp>, <p>. |
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN" datatype="JavaText"
original="Test1.java">
...
</file>
<file tool="XYZ v1.0" lc="EN" datatype="rtf"
original="\\brazil\recife\data.rtf>
...
</file>
</opentag>
|
An OpenTag document with two <file>
elements of different data types. |
<prop> |
Property - The <prop>
element allows the tools to specify non-standard
information in the OpenTag document. |
Mandatory attributes: |
type. |
Optional attributes: |
lc, rid. |
Contents: |
Tool-specific data or text, no
standard elements. |
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN"
datatype="JavaText"
original="Test1.java">
<p id="23" type="caption">Input</p>
<p id="24" type="label">File name:</p>
<prop type="WordCount">3</prop>
</file>
</opentag>
|
Here the <prop> element is
used to define a tool-specific property called "WordCount".
You could also use it to specify attached files, project
information, translation memory data, machine translation
processing data, etc.
The tool attribute identifies which tool has generated
the document so each property can be identified even if
two tools use the same property identifiers.
To define tool-specific data at the tag level, you can
use the ts attribute. |
<grp> |
Group - The <grp>
element specifies a set of elements that should be
processed together. For example: all the items of a menu,
several translations of the same paragraph, etc. A list
of preferred values for the type
attribute in <grp> is available.
Note: A <grp> element can
contain other <grp> elements. |
Mandatory attributes: |
None. |
Optional attributes: |
tool, datatype, id, rid, seg, coord, font, type, lc, ws, ts, cond, var. |
Contents: |
Zero, one or more of the
following elements: <p>, <grp>, <note>, <prop>. |
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN"
datatype="JavaText"
original="Test1.java">
<grp rid="DLG_INPUT">
<p id="23" type="caption">Input</p>
<p id="24" type="label">File name:</p>
</grp>
</file>
</opentag>
|
Here the <grp> element is
used to group together several <p> elements belonging to
the same dialog box.
<grp> could also be used to group several language
versions of the same <p> element. |
<p> |
Paragraph - The <p>
element is used to delimit a unit of text. A paragraph in
OpenTag does not necessarily correspond to a "paragraph"
in a word-processor. It's simply a unit of text that
could be a paragraph, a title, a menu item, a caption,
etc. A list of preferred values for the type
attribute in <p> is available. |
Mandatory attributes: |
None. |
Optional attributes: |
tool, datatype, id, rid, seg, coord, font, type, lc, ws, ts, cond, var. |
Contents: |
Text, zero, one or more of the
following elements: <s>, <mrk>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/> and <rf/>. |
<grp id="STR_item23">
<p lc="EN">Monday</p>
<p lc="fr-fr">Lundi</p>
<p lc="TR">Ptesi</p>
<p lc="cs">pondělí</p>
</grp>
|
A set of different translations
of the same <p> element. In this example, the term
"Monday" in English, French, Turkish and Czech. |
<csdef> |
Code set definition - The
<csdef> element specifies user-defined code sets
and characters.
|
Mandatory attributes: |
name, base. |
Optional attributes: |
None |
Contents: |
Zero, one or more <note> elements followed by
one or more <map/> elements. |
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN"
datatype="rtf"
original="c:\proj34\doc\Hobbit.rtf">
<csdef name="Latin1Cirth" base="iso-8859-1">
<map code="130" ucode="" ent="noldorian_o"/>
<map code="S" ucode="57558" ent="noldorian_oo"/>
</csdef>
</file>
</opentag>
|
Here the <csdef> element is
used to declare a user-defined code set called "Latin1Cirth"
which uses the ISO Latin-1 code set as a base (all code-points
not specified in the <csdef> are the same as ISO
8859-1). |
<map/> |
Character mapping - The
<map/> element specifies the correspondence between
a Unicode value and a code-point of a native code set. |
Mandatory attributes: |
code, ucode. |
Optional attributes: |
ent, comp, case, subst. |
Contents: |
Empty. |
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN"
datatype="rtf"
original="c:\proj34\doc\Hobbit.rtf">
<csdef name="Latin1Cirth"
base="iso-8859-1">
<map code="130" ucode="" ent="noldorian_o"/>
<map code="S" ucode="57558" ent="noldorian_oo"/>
</csdef>
</file>
</opentag>
|
Here the <map> elements
defines two user-defined characters. You must use Unicode
values that are within the range of the Private Use Area
(from U+E000 to U+F8FF). See the Unicode Standard 2.0
book, section 6.2 at page 6-119, for more information on
the Private Use Area. |
<note> |
Note - The <note>
element is used to add document-related comments to the
OpenTag document. XML comments ("<!-- ... -->")
are allowed but are not necessarily kept by processing
tools. |
Mandatory attributes: |
None. |
Optional attributes: |
lc, rid. |
Contents: |
Text, no standard elements. |
<grp id="4567">
<note>This paragraph must always be in uppercase</note>
<p lc="EN"><g id="1">WARNING:</g> YOU MUST
SETUP YOUR WORKING DIRECTORY BEFORE RUNNING
THE CONFIG tool.</p>
</grp>
|
The <note> element can be
used to document the extracted text, to provide
information between the different users that deal with
the file, etc.
Tools must keep <note> elements when they process
an OpenTag document. You can link a note to other
elements with the rid attribute. |
The in-line elements
are the elements that can appear inside the core structural
element <p/>.
<g> |
Generic group place-holder
- The <g> element is used to replace any in-line
code of the original document that has a beginning and an
end and can be moved within its parent structural element.
When possible, the type allows you to specify what kind
of attribute the place-holder represents. A list of preferred
values for the type attribute in <g> is available.
Note: A <g> element can contain
another <g> element. In this case, if the embedded
group has an id attribute, it should never be
moved outside of its parent group. |
Mandatory attributes: |
None. But a <g> element
should at least have an id or type attribute to make sense. |
Optional attributes: |
id, type, rid, ts. |
Contents: |
Text. Zero, one or more of the
following elements: <s>, <mrk>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/>, <rf/>. |
<p>Text with some <g id="1"><g id="2">formatting</g> and some other.</g>
|
|
<x/> |
Generic place-holder - The
<x/> element is used to replace any code of the
original document. |
Mandatory attributes: |
None. But a <x/> element
should at least have an id or type attribute to make sense. |
Optional attributes: |
id, type, rid, ts. |
Contents: |
Empty. |
<p><x id="1"/>Text with generic code place-holder.</p>
|
|
<ix/> |
Index marker - The <ix/>
element specifies a reference to an index entry. The
definition of the entry itself is done in the
corresponding <ixd> element (both are
linked by their rid attribute, for which they have
the same value). |
Mandatory attributes: |
rid. |
Optional attributes: |
id, ts. |
Contents: |
Empty |
<p>Term<ix rid="INDEX2"> to index.</p>
|
|
<ixd> |
Index definition - The
<ixd> element is used to specify the entry
corresponding to one or more <ix/> elements. It does not
have to be in the same <p> or even the same <grp> element. Markers and
definitions do have to be in the same <file> element. |
Mandatory attributes: |
rid. |
Optional attributes: |
id, ts. |
Contents: |
One or more <lvl> elements. |
<p><ixd rid="INDEX2" id="34">
<lvl><tx>$ command</tx></lvl></ixd></p>
|
The <ixd> element used to
define a simple index entry. Here it defines the text for
all <ix/> markers that also have
the rid attribute set to "INDEX2". |
<tx> |
Index entry text- The <tx>
element is used to delimit the text of an index entry
level. |
Mandatory attributes: |
None. |
Optional attributes: |
id, seg, ts. |
Contents: |
Text. Zero, one or more of the
following elements: <s>, <mrk>, <g>, <ocs>, <ct>, <x/>, <rf/>, and zero or one so element. |
<p><ixd rid="INDEX2" id="34">
<lvl><tx>$ command</tx></lvl></ixd></p>
|
The <txt> element used to
define a simple index entry. |
<lvl> |
Level - The <lvl>
element is used to delimit the different levels of an
index entry. |
Mandatory attributes: |
None. |
Optional attributes: |
id, ts. |
Contents: |
one <txt> element and zero
or one <so> element. |
<p><ixd rid="INDEX2" id="34">
<lvl><tx>$ command</tx></lvl></ixd></p>
|
The <lvl> element used to
define a simple index entry. |
<so> |
Sort order - The <so>
element indicates the text that should be used to sort an
index entry in an <lvl> element. |
Mandatory attributes: |
None. |
Optional attributes: |
id, seg. |
Contents: |
Text, no elements. |
<p><ixd rid="INDEX2" id="34">
<lvl>$ command<so>dollar command</so>
</lvl></ixd></p>
|
The <so> element used to
specify the "reading order" of an entry to sort
the symbol according its pronunciation. |
<rf/> |
Reference marker - The
<rf/> element specifies a reference to any type of
reference text (variable, pre-composed text, footnote,
etc.). The definition of the reference text itself is
done in one or more corresponding <p> elements (linked by
their rid attribute, for which they have
the same value). |
Mandatory attributes: |
rid. |
Optional attributes: |
id, ts type. |
Contents: |
Empty. |
<p id="1" rid="1" type="fn">Elephant: Big animal.</p>
<p id="2">The happy elephant<rf rid="1" type="fn"/>.</p>
<grp rid="2">
<p type="alt" id="3">Click here to go to Description</p>
<p type="link" id="4">http://www.xyz.com/desc.htm</p>
</grp>
<p id="5">See <g id="1" rid="2">Description</g>.</p>
|
The <rf/> element can be
used to reference anything. It simply marks the position
where the text should go. The link between reference
definition and marker is done with the rid attribute.
For example here the first paragraph is a definition of a
footnote that is located in the second paragraph. Note
that in some case the reference can be composed of
several <p> elements, like for the <g> element of the fifth
paragraph. |
<ocs> |
Original code set - The
<ocs> element is used to indicate the code set of a
part of the text that is different from the default code
set. Note that <ocs> is only informative; in the
OpenTag file the text within an <ocs> element is in
the same code set as the surrounding text. |
Mandatory attributes: |
cs. |
Optional attributes: |
id, ts. |
Contents: |
Text, zero, one or more of the
following elements: <s>, <mrk>, <g>, <ixd>, <ct>, <x/>, <ix/>, <rf/>. |
<p><ocs cs="Symbol">✔</ocs> First
item of the list</p>
|
Here the <ocs> element
allows you to specify that the first character of the
paragraph is a check mark symbol and should be coded in a
code set different from the rest of the text when merged
back.
Remember that <ocs> does not specify a change of
code set in the OpenTag file itself. |
<ct> |
Conditional text - The
<ct> element is used to mark specific strings of
the text for a given condition. |
Mandatory attributes: |
cond. |
Optional attributes: |
id, ts. |
Contents: |
Text. Zero, one or more of the
following elements: <s>, <mrk>, <g>, <ixd>, <ocs>, <x/>, <ix/>, <rf/>. |
<p id="2">See <ct cond="doc">page 34</ct><ct
cond="hlp">screen 7</ct>
for more information.</p>
|
The <ct> element used to
mark two different text corresponding to two different
outputs. |
OpenTag defines
additional elements to support various types of text processing.
These elements are usually not generated by the extraction module
and are ignored most of the time during merging, but they can be
very powerful with tools such as Machine Translation, glossary
handling, quality assurance, etc.
<s> |
Segment - The <s>
element indicates a unit of text such as a sentence,
title, menu item, message, etc. The <s> element is
not part of the tags used to merge the OpenTag file back
into its original format. A list of preferred
values for the type attribute in <s> is available. |
Mandatory attributes: |
None. |
Optional attributes: |
seg, ts, id. |
Contents: |
Text. Zero, one or more of the
following elements: <mrk>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/>, <rf/>. |
<p id="1"><s seg="1">Click OK. </s><s
seg="2">Save the file.</s></p>
|
The <s> element separates
segments within a paragraph. When the <p> element contains only a
single segment you can avoid using <s> and simply
use the seg attribute. |
<mrk> |
Marker - The <mrk>
element delimits a section of text that has special
meaning, such as a terminological unit, a proper name, an
item that should not be modified, etc. It can be used for
various processing tasks. For example, to indicate to a
Machine Translation tool, proper names that should not be
translated, for terminology verification, to mark suspect
expressions after a grammar checking. The <mrk>
element is usually not generated by the extraction tool
and it is not part of the tags used to merge the OpenTag
file back into its original format. A list of preferred values for the type
attribute in <mrk> is available. |
Mandatory attributes: |
type. |
Optional attributes: |
id, ts. |
Contents: |
Text. Zero, one or more of the
following elements: <s>, <g>, <ixd>, <ocs>, <ct>, <x/>, <ix/>, <rf/>. |
<p lc="EN-US">Use a <mrk type="term">regular expression</mrk>
to search for <mrk type="name">Hobbit</mrk> item marker.
</p>
|
In this example the <mrk>
element is used to tag a glossary term as well as a
proper name. |
This section lists the
various attributes used in the OpenTag elements. An attribute is
never specified more than once for each element.
version |
OpenTag version - The
version attribute is used to specify the format version
of the OpenTag document. |
Value description: |
A number. |
Default value: |
Empty string. |
Element using it: |
<opentag>. |
<opentag version="1.2">
<file tool="XYZ v1.0" lc="EN"
datatype="rtf"
original="c:\simaril6\gandalf.rtf">
...
</file>
</opentag>
|
This example shows an OpenTag
document corresponding to the specifications of version 1.2. |
tool |
Creation tool - The tool
attribute is used to specify the signature and version of
the tool that created or modified the document. |
Value description: |
Not defined by the standard. |
Default value: |
Empty string. |
Elements using it: |
<file>, <grp>, <p>. |
<file tool="XYZ v1.0" lc="EN"
datatype="rtf"
original="c:\simaril6\gandalf.rtf">
...
</file>
|
Here the creation tool is
identified as "XYZ v1.0". Usually you want your
tool signature to indicate the version as well as the
tool.
The tool attribute allows you to know how you should
process tool-specific data such as <prop> elements and ts attributes. |
datatype |
Data type - The datatype
attribute specifies the kind of text contained in the
element. Depending on that type, you may apply different
processes to the data. |
Value description: |
Not defined by the standard.
However, a list of recommended values is
provided. |
Default value: |
Empty string. |
Elements using it: |
<file>, <grp>, <p>. |
<file lc="EN-US"
tool="LXString 1.01-004"
datatype="JavaString"
original="//brazil/adm/tmp/app.pro">
...
</file>
|
The datatype attribute here
specifies that the text in the file has been extracted
from a Java property or source code file. |
date |
Date - The date attribute
indicates when a given element was created or modified. |
Value description: |
CCYY-MM-DDThh:mm:ss (for local
time) or CCYY-MM-DDThh:mm:ssZ (for UTC time). |
Default value: |
Empty string. |
Element using it: |
<file>. |
<file lc="EN-US"
tool="Java_OTF 1.01-004"
datatype="JavaString"
original="//brazil/adm/tmp/app.pro"
date="1997-11-25T06:12:00">
...
</file>
|
The date attribute specifies 25
November 1997 at 6am 12 minutes zero seconds. |
lc |
Locale - The lc attribute
specifies the locale of the text of a given element.
|
Value description: |
A 2-letter code corresponding to
one of the language identifiers defined in ISO-639, or a 2+2-letter code
where the first 2 letters are one of the language
identifiers defined in ISO-639 followed by a dash and
one of the country/region identifiers defined in ISO-3166.
Note: The reserved xml:lang attribute defined in XML does
not correspond to OpenTag's definition of a locale, and
its scope rules are not appropriate for attributes in the
OpenTag case. Therefore OpenTag does not use it to
indicate locale/language. However the lc attribute uses
values that are very similar to the values used for xml:lang. |
Default value: |
Empty string. |
Elements using it: |
<file>, <grp>, <p>, <note>, <prop>. |
<grp id="id_SEARCH">
<p lc="EN-US">&Search...</p>
<p lc="FR-FR">&Recherche...</p>
</grp>
|
An OpenTag document can contain
multi-lingual data: The lc attribute is used to tag each
specific locale. |
cs |
Code set - The cs
attribute specifies the code set of the text for a given
element. When the encoding of the file is UCS-2 or ISO-646
the cs attribute is only informative.
|
Value description: |
One of the code set identifiers
defined by the IANA, or a user-defined code
set name declared in a <csdef> element. A sub-set
of the preferred values is available in this document. |
Default value: |
Empty string. |
Elements using it: |
<ocs>. |
<p lc="EN">Text in English</p>
<p lc="cs"><ocs cs='cs="iso-8859-2">Text in Czech</ocs></p>
|
The text within an <ocs> element is in the same
code set as the rest of the file, but the cs attribute
indicates what was the original code set in the source
document. |
name |
Name - The name attribute
specifies the user-defined code set name of a <csdef> element. |
Value description: |
Not specified by the standard. |
Default value: |
Empty string. |
Elements using it: |
<csdef>. |
<csdef name="Latin1Cirth" base="iso-8859-1">
<map code="130" ucode="" ent="noldorian_o"/>
</csdef>
|
This example shows how the name
attribute is used to identify a <csdef> element.
The name value can contain any characters, however, white
space characters are not recommended. |
type |
Type - The type attribute
specifies the context and the type of resource or style
of the data of a given element. For example, to define if
it is a label, or a menu item in the case of resource-type
data, or the style in the case of document-related data. |
Value description: |
The value will depend on each
element. A recommended list of values is provided by the
standard. |
Default value: |
Empty string. |
Elements using it: |
<prop>, <file>, <grp>, <p>, <g>, <rf/>, <x/>. |
<p type="message">Cannot find %s.</p>
<p type="label">List:</p>
|
The type attribute used to give
context information with a paragraph. |
id |
Identifier - The id
attribute is used in many elements, usually as a unique
reference to the original corresponding format for the
given element. |
Value description: |
Alpha-numeric. It is recommended
to not use spaces. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>, <g>, <x/>, <ix/>, <ixd>, <lvl>, <so>, <rf/>, <ocs>, <ct>, <s>, <mrk>. |
<p id="34">Extracted text</p>
<p id="IDC_file_OPEN">&Open...</p>
|
The id attribute can be extracted
from the original file, or generated automatically. |
rid |
Reference identifier - The
rid attribute is used to link different elements that are
related. For example, a reference to its definition, or
paragraphs belonging to the same group, etc. |
Value description: |
Alpha-numeric. It is recommended
to not use spaces. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>, <note>, <ix/>, <ixd>, <g>, <rf/>, <x/>, <prop>. |
<p id="23">Start <rf rid="1"/>.</p>
<p id="24" rid="1">YZApplication</p>
|
In this example the attribute rid
links a reference marker with its definition later in the
file. |
cond |
Condition - The cond
attribute is used to identify an element corresponding to
conditional text in the original format.
You can use the <ct> element to set a
condition for a sub-set of text. |
Value description: |
Alpha-numeric. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>, <ct>. |
<p id="12" cond="Common">Text for
<ct cond="DocOnly">the documentation</ct>
<ct cond="HlpOnly">the On-line help</ct>
only.</p>
|
This paragraph has some common
text and two variations; one for documentation, the other
for on-line help. |
seg |
Segment identifier - The
seg attribute is used to mark an element as a segment or
specific translation unit. |
Value description: |
Alpha-numeric. It is recommended
to not use spaces. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>, <lvl>, <so>, <s>. |
<p id="3" seg="4">Single segment in a
paragraph.</p>
|
The seg attribute can be used
directly in a paragraph if the paragraph contains only a
single segment. You can also mark segments this way in
each level of an index definition. |
ts |
Tool-specific data - The
ts attribute allows you to include short data understood
by a specific toolset.
You can also use the <prop> element to define large
properties at the element level. |
Value description: |
Not defined by the standard. |
Default value: |
Empty string. |
Elements using it: |
<file>, <grp>, <p>, <ix/>, <lvl>, <rf/>, <ocs>, <ct>, <s>, <mrk>, <x/>, <g>. |
<grp seg="9" >
<p lc="EN-EN">XYZ printer Setup Dialog</p>
<p lc="FR-FR" ts="98%,hobbit.tm"
>Installation de l'imprimante XYZ</p>
</grp>
|
Here the ts attribute is used to
specify the origin of a leveraged translation. |
coord |
Coordinates - The coord
attribute specifies the x, y, cx and cy coordinates of
the text for a given <p> or <grp> element. The cx and cy
values must represent the width and the height (like in a
Windows resource file). The extraction and merging tools
must make the right corrections for the original format
that uses a top-left/bottom-right coordinate system. |
Value description: |
Four decimal (possibly negative)
values, in the order: x,y,cx and cy, separated by semi-colons. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>. |
<grp type="button" coord="8;8;50;14;">
<p lc="EN">&Help...</p>
<p lc="IT">&Aiuto...</p>
</grp>
|
|
font |
Font - The font attribute
specifies the font name and font size of the text for a
given <p> or <grp> element. The font
attribute would generally be used for resource-type data:
change of font in document-type data can be marked with
the <g> element. |
Value description: |
Name of the font and its size
separated by a semi-colon. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>. |
<grp type="dialog" coord="0;0;100;150;"
font="MS Sans Serif;8">
<p type="caption">Settings</p>
<p type="button">OK</p>
<p type="button">Cancel</p>
</grp>
|
Font attribute in a file
extracted from Windows resources. The font information
could be used by resizing tools, to verify maximum length
of a translation, etc. |
ws |
White spaces - The ws
attribute specifies how white spaces (ASCII spaces, tabs
and line-breaks) should be treated. |
Value description: |
Its value must be:
- 0 if any consecutive white spaces are reduced to one
space ( ).
- 1 if all white spaces must be preserved (i.e. like in a
<PRE> element in HTML format).
- 2 like case 0, but excluding tab from white-spaces. |
Default value: |
0 |
Elements using it: |
<file>, <grp>, <p>. |
<p ws="0">Text
with 4 spaces </p>
<p ws="1">Text with 4 spaces </p>
|
The white spaces in the first
paragraph will be reduced to one space each. In the
second paragraph each white space should be preserved by
the tools. In this case the end result will be the same
for both paragraphs. |
original |
Original file - The
original attribute specifies the name of the original
file from which the contents of a <file> element has been
extracted. |
Value description: |
Alpha-numeric. |
Default value: |
Empty string. |
Element using it: |
<file>. |
<file lc="EN-US"
tool="Java_OTF 1.01-004"
datatype="JavaString"
original="//brazil/adm/tmp/app.pro"
date="19971125T061200">
...
</file>
|
The original attribute could be
used by tools to locate the various files needed when
merging back the OpenTag document. |
reference |
Reference
file - The reference attribute specifies the name of the reference
file that should be used to merge back the content of a <file> element into its original format. |
Value description: |
Alpha-numeric. |
Default value: |
Empty string. |
Element using it: |
<file>. |
<file lc="EN-US"
tool="Java_OTF 1.01-004"
datatype="JavaString"
original="//brazil/adm/tmp/app.pro"
reference="//brazil/adm/tmp/app.skl"
date="19971125T061200">
...
</file>
|
The reference attribute could be
used by tools to locate the various files needed when
merging back the OpenTag document in case they are not the same as the
original file. The reference file is often called the
"skeleton" file. |
base |
Base code set - The base
attribute specifies the code set upon which the re-mapping
of characters defined by a given <csdef> element is based. |
Value description: |
One of the code set identifiers
defined by IANA. |
Default value: |
Empty string. |
Element using it: |
<csdef>. |
<csdef name="Latin1Cirth" base="iso-8859-1">
<map code="130" ucode="" ent="noldorian_o"/>
</csdef>
|
In this example, the attribute
base indicates that the code set upon which the user-defined
characters are specified is ISO Latin-1. |
code |
Native code - The code
attribute specifies the code-point of the given <map/> element in a native non-Unicode
code set. |
Value description: |
Its value must be either a
character reference format or in text. In the latter case,
the text must be the decimal value of the code-point. |
Default value: |
Empty string. |
Element using it: |
<map/>. |
<map code="131"
ucode=""
ent="noldorian_oo"/>
|
|
ucode |
Unicode code - The ucode
attribute specifies the Unicode code-point of a given <map/> element. |
Value description: |
Its value must be a valid Unicode
code-point value, either a character reference format or
in text. In the latter case, the text must be the decimal
value of the code-point. |
Default value: |
Empty string. |
Element using it: |
<map/>. |
<map code="131"
ucode="57558"
ent="noldorian_oo"/>
|
|
ent |
Entity name - The ent
attribute specifies the name of the character for a given
<map/> element. |
Value description: |
Its value must be a valid ASCII
name (e.g. "amp" for '&'). |
Default value: |
Empty string. |
Element using it: |
<map/>. |
<map code="131"
ucode="57558"
ent="noldorian_oo"/>
|
|
subst |
Substitution text - The
subst attribute specifies the text to substitute for a
character of a given <map/> element, when it does
not exist in the target code set. |
Value description: |
Its value must be a valid ASCII
character or string (e.g "(c)" for '©'). |
Default value: |
Empty string. |
Element using it: |
<map/>. |
<map code="131"
ucode="57558"
ent="noldorian_oo"
subst="oo"/>
|
|
comp |
Composition codes - The
comp attribute specifies the possible base Unicode
characters used to compose the character of a given <map/> element. |
Value description: |
Its value must be a list of two,
three, four or five Unicode values (including user-defined
characters). Each value is separated by a semi-colon and
should be either in character reference format or in text.
In the latter case, the text must be the decimal value of
the Unicode code-point. |
Default value: |
Empty string. |
Element using it: |
<map/>. |
<map code="167"
ucode="57000"
ent="acircacutetone"
comp="ấ"/>
|
For example, in Vietnamese the
letter a circumflex can have an additional acute tone
mark. Some fonts may need to have a direct mapping to
this combination. |
case |
Case - The case attribute
specifies the opposite case character for one given in a <map/> element. (e.g. 'A' is
the case change for 'a'). |
Value description: |
Its value must be a valid Unicode
code-point value (including user-defined characters),
either character reference format or text format. In the
latter case, the text must be the decimal value of the
code-point. |
Default value: |
Empty string. |
Element using it: |
<map/>. |
<csdef name="Deseret" base="iso-8859-1">
<map code="63" ucode="" ent="deseret_BEE" case=""/>
<map code="111" ucode="" ent="deseret_bee" case=""/>
</csdef>
|
|
var |
Variant - The var
attribute allows you to identify different elements
according to the way they have been generated. |
Value description: |
Its value is not specified by the
standard. |
Default value: |
Empty string. |
Elements using it: |
<grp>, <p>, <s>. |
<grp id="45">
<p lc="EN-US>&Search...</p>
<p lc="FR-FR>&Chercher...</p>
<p lc="FR-FR var="MT">&Recherche...</p>
</grp>
|
In this example, the var
attribute is used to specify an additional proposition
for the translation, here coming from a Machine
Translation product. It could be used for marking other
automatic translations (TM, glossary leveraging, etc.),
verified text, even keeping a history list. |
This section lists the
recommended values for some of the attributes. Values for these
attributes are not case sensitive. These lists are purely
informative, the goal is to specify a preferred syntax so tools
can have some level of compatibility.
Values
for the type attribute of the <grp>, <p>, <s> and <rf/> elements. This list is not
exhaustive.
- shortcut (Windows
accelerators, shortcuts in resource or property files)
- button (button
in UI)
- caption (title
in UI, caption in documentation, alternate text, etc.)
- checkbox (check
box in UI)
- cell (text
in a table cell)
- dialog (dialog
box in UI)
- file (filename,
path)
- footer (footer
text)
- font (font
name)
- frame (frame
or window, or any generic group of components).
- header (header
text)
- heading (title
or header-type segment)
- keywords (list
of keywords, enumeration within a paragraph, etc.)
- label (static
text, label in UI, etc.)
- listitem (paragraph
in a list, entry in a list box, etc.)
- menu (menu)
- menuitem (entry
in a UI menu)
- message (prompt,
error or warning message)
- radio (radio
button in UI)
- string (generic
text from source code, string table, etc.)
- var (variable)
- fn (footnote)
Values for the type attribute of the <mrk> element. This list is not
exhaustive.
- abbrev (abbreviation,
acronym, etc.)
- datetime (date
or time information)
- name (proper
or common name)
- phrase (sub-sentence
level)
- protected (text
that should remain untouched during the process)
- term (one
or more words of a terminology entry)
Values for the type attribute of the <g> element. This list is not
exhaustive.
- bold (bold
or strong text)
- font (text
with font size, font face, color changes etc. )
- italic (italicized
text)
- link (hypertext)
- underlined
(underlined text)
Values for the type attribute of the <x/> element. This list is not
exhaustive.
- pb (paragraph
break)
- lb (forced
line break)
Values for the cs and base attributes. This list is not exhaustive,
but gives you examples from which you can guess additional names.
- iso-8859-1
(ISO 8859-1, Latin 1, ANSI)
- iso-8859-2
(ISO 8859-2, Latin 2, Slavic)
- iso-8859-4
(ISO 8859-4, Latin 4, Baltic)
- windows-1252
(Windows code-page 1252, aka. "Windows ANSI")
- windows-1256
(Windows code-page 1256, Arabic)
- windows-874
(Windows code-page 874, Thai)
- shift_jis (Japanese
Shift-JIS, code-page 932)
- big5 (Traditional
Chinese, code-page 950)
- gb2312 (Simplified
Chinese GBCode, code-page 936)
- dos-862 (DOS
code-page 862 Hebrew)
- koi8-r (Cyrillic
KOI8-R code set)
- utf-8 (UTF-8
Unicode)
Etc... See the IANA documentation for more
information.
Values for the datatype attribute. This list is not
exhaustive.
- CDF (Channel
Definition Format)
- CPP (C and
C++ style text)
- JavaScript
(JavaScript, ECMAScript scripts)
- HTML (HTML,
DHTML, etc.)
- Interleaf (Interleaf
document)
- Java (Java,
source and property files)
- Lisp (Lisp)
- MIF (Frame
maker MIF, MML, etc.)
- Pascal (Pascal,
Delphi style text)
- RTF (Rich
Text Format)
- SGML (Generic
SGML)
- PlainText (Plain
text)
- VBScript (Visual
Basic scripts)
- WinRes (Windows
resources, RC, DLL, EXE)
This section shows a
short sample of an OTF file.
Notation conventions:
- The indentations
are only to illustrate the hierarchy of the elements,
they are not required.
- The BOLD
elements and attributes are mandatory.
- ITALICS
indicates elements and attributes that can be specified
zero or one time.
- NORMAL typeface is
used for the elements and attributes that can be
specified zero, one or more times.
- UNDERLINED
typeface indicates the actual text and non-structural
codes (the data).
Sample:
<?XML version="1.0" encoding="iso-8859-1" ?>
<opentag version="1.2"
xmlns="urn:OpenTag:Version12">
<!-- First file, from a Java property file. It contains several locales. -->
<file
lc="EN-US"
tool="Java_OTF:1.01-004:Java"
datatype="Java"
original="//brazil/recife/devile/data/app.pro"
reference="//brazil/recife/devile/data/extract/app.skl">
<grp rid="id_DLG_STATUS" type="label">
<grp id="IDC_ACTIVITY" coord="8;72;54;10">
<p lc="EN-US">&Activity</p>
<!-- Tools specific data, e.g. in this case leverage information -->
<p lc="FR-FR" ts="100%,Gandalf3.tm">&Activit&x00e9;</p>
</grp>
</grp>
<!-- Example of a note generated by a filter -->
<note>Extraction word count = 1</note>
</file>
<!-- Second file, this time from RTF. It contains only the text of the source language. -->
<file
lc="EN-US"
tool="Borneo 1.00-017"
datatype="RTF"
original="//brazil/recife/devile/data/help.rtf">
<!-- Definition for two user-defined characters. -->
<csdef name="Latin1Cirth" base="ISO-8859-1">
<note>For more information about the Cirth see
the Web page http://www.indigo.ie/egt/standards/csur/cirth.html
</note>
<map code="130" ucode="" ent="noldorian_o"/>
<map code="S" ucode="57558" ent="noldorian_oo"/>
</csdef>
<p id="1">This is a text in <g type="bold">bold</g>
and <g id="1">all caps red</g></p>
<p id="2">Second paragraph with graphic <x id="1"/>.</p>
</file>
</opentag>
From 1.1b to 1.2 (Nov-06-1998)
- The element <csa>
has been removed.
- The attribute cs
has been removed in all elements.
- The optional
attribute reference has been added in the element <file>.
- The element <tx>
has been added in the element <lvl>.
- The content of the
element <lvl> has been modified to include only a
<tx> element and an optional <so> element.
- The optional
attribute lc has been added in the elements <g> and
<mrk>.
From 1.1 to 1.1b (Apr-22-1998)
- The optional
attribute rid has been added to the <g> and <x/> elements.
From 1.0 to 1.1 (Mar-18-1998)
- Element and
attributes names must be in lowercase (XML being case
sensitive).
- The optional
attribute type has been added to the <x/> element.
- The mandatory
attribute id in the <x/> element is now optional.
- The optional
attribute ts has been added to the <x/> and <g> elements.
- The elements <pb/>
and <lb/> have been removed. In-line breaks should
now be set as<x/> elements with the
appropriate type attribute.
- The <rfd>
and <fnd> element have been removed. They are
replaced by normal <p> elements with the
appropriate type attribute.
- The <fn/>
element has been removed. It has been replaced by a
normal <rf/> element with the
appropriate type attribute.
- The optional
attribute type has been added to the <rf/> element.
- The attribute name
of the <prop> element has been
renamed type.
. OpenTag provisions from other
publications:
- ISO 639:1988
- Code for the
representation of names of languages.
See http://www.unicode.org/unicode/onlinedat/languages.html
- ISO 3166:1993
- Code for the
representation of names of countries.
See http://www.unicode.org/unicode/onlinedat/countries.html
- ISO 646:1991
- Information
Technology -- ISO 7-bit coded character set for
information interchange (ASCII).
- ISO 8601:1998
- Data elements and
interchange formats - Information interchange -
Representation of dates and times.
- ISO 8879:1986
- Information
Processing - Text and Office Systems - Standard
Generalized Markup Language.
See http://www.sgmlopen.org/
- ISO 10646-1:1993
- Information
Technology - Universal Multiple-Octet Coded Character Set
(UCS) - All parts.
See http://www.unicode.org/
- IANA Code set
names
- Code set naming
conventions.
See ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
- Extensible
Markup Language
- Extensible Markup
Language specifications.
See http://www.w3.org/TR/REC-xml