SGML: OpenTag XML Conformance

From w3c-sgml-wg-request@w3.org Wed Mar  5 23:17:52 1997
Date: Thu, 06 Mar 1997 00:14:02 -0500
Resent-from: w3c-sgml-wg@w3.org
From: Harvey Bingham <hbingham@acm.org>
Subject: Re: Kind of surprising commercial XML application (long)
   -----------------------------------------------------------------

> http://www.ile.com/opentag/
>
>I don't understand the application domain enough to really be fully 
>in tune with all the implications, but it's an interesting read. - Tim

Thanks, Tim for the pointer.

Executive Summary: The primary connection to this wg is their marketing 
hype, claiming they do XML. The novel idea for natural language
processing is that text zones can be extracted from a word-processing
proprietary format and tagged to the opentag DTD, manipulated (natural
language translation of those text zones), and then restored to the
places where the original text zones were, yielding the same formatting
for a new document in the different natural language. Custom
translators to and from the proprietary format are needed for the
25 elements of the OpenTag DTD.

----
1. Analysis:

OpenTag focuses on localization (natural langage processing, NLP) of text
in documents in any proprietary formatting text processor. It uses its
standard DTD that will support open data encoding methods during 
the localization process. It provides a 25-element DTD that assertedly
preserves the significant clues useful for producing natural language
translation of each text object. Presumably translation is easier using 
the opentag markup. When the translation is complete, the reverse process
occurs: it substitutes the translated text objects back into the 
corresponding places in the original formatted document, so all the
original formatting markup is reused. The SGML version is only the 
intermediary. It need not persist beyond the translation process.

They assert "XML compliant (and therefore SGML compliant as well)". They
aren't there yet. [See Eve's message as well.]

The OpenTag DTD preserves a few styles that might provide aids to
translation [but not font, face, pointsize information, that might
convey hierarchy or captioning]. Presumably a human translator
would have the original formatted document to work with that could
convey this other information.

A tagged instance has sequential numeric ids assigned to 
paragraphs in a document and again ordered sequential numeric ids to 
each inline textual object in each paragraph. As the content model 
includes tags for inlines in a paragraph, the mixed content #PCDATA 
parts would remain in the same order separated by injections of inlines.
A generic x in-line code tag is available, that might be used to
surround #PCDATA.

It handles footnotes, index terms, and bibliographic references. 
It preserves column breaks and table cell breaks. 

They allow (character) codeset change within a document from
that specified in an initial PI.
 
Comparing different localizations for the same formatting language X
of the same original would all share the ordered ids. 

Transformation to a different target formatting language Y, such as RTF,
HTML, Wordperfect, etc. would not generally be possible, unless there
were a richer way to assign corresponding style recognizers and generators.

Some formatting languages such as MicroSoft Powerpoint scramble object
order. There is no implicit left-to-right, top-down text progression
as is true for western languages in most word-processing format languages.
(Even in those, table cell line-wrapping can break that model.)

I believe the opentag document instances will lose much information 
in the translation process of mapping everything onto their 25 generic
elements, nine of which are EMPTY. The consequence is that they will 
not be able to generally go from X to different formatting language Y.

2. The DTD -- opentag.dtd

The OpenTag DTD claims conformance with
"XML, Extensible Markup Language, working draft 11-14-96".

The assertion of compatiblity with XML is commendable, but overstated.

Omissible tag indicators should not be present. Or, if they are,
the omissible endtag on a content model is improper:

<!ELEMENT P - O ...> is inconsistent with XML.

All hierarchic structures are flattened into a sequence of paragraphs P.

Self-exclusion exceptions occur, on b,i,u,d bold, italics, underline,
doubleunderline) indicate a style attribute on something -- but why stop
with those few? smallcaps, overbar, strikethrough, subscript, superscript, etc.]

<!ELEMENT so - - RCDATA>

RCDATA is disallowed in XML.

There is no element declaration for level, used in the ixd content
model. Instead there is a lvl element, with pernicious mixed
content, ((%paratxt;)*,so) that is illegal in XML.

The PB tag semantics refer to an otherwise undefined definition group.


3. The OpenTag Format Specifications -- otspecs.htm

This includes a markup reference to XML, character entities (either
SGML &name;, or Unicode &#xxxx;), a file header, and a tag description
table, providing some semantics. 
An initial PI file header describes version, type, encoding, locale, 
and original. This suggests document==file. There is no mention of
external entities.

No attributes are on the base document element, opentag.
Almost universal is an id NUMBER attribute. Few other attributes exist.
The element type distinguishes use: target or reference for the id value.

The example for paragraph (which has a content model) "<p id=1/>" is 
inconsistent with SGML. It uses the XML empty tag variant "/>" in 
place of tagc, not now present in SGML.

I am confused by the use of id as a numeric (not ID) attribute. It is
appropriate for the textual definitions of footnotes,
index terms, and (biographic) references -- FND, IXD and
RFD, but also in their EMPTY points of reference in FN, IX, and RF. 
As the latter are actually references, in SGML I believe it would be 
better with a distinguishing attribute name, possibly "idr", rather than
overload id. I see no reason why the "idr" need to be unique in the
file. Several references to the same object should be allowed.
[The end of the RFD semantics seems in error, referring to footnote
definition, rather than reference definition.]

The use of sequential integers starting from 1 at each <p> for each 
in-line makes any use of the id values during language processing
awkward. Since each paragraph has its own sequence, presumably the 
paragraph ids, unique to the document, also a set of sequential numbers
starting at 1 as qualifiers to disambiguate the ids of the in-line children.
But since the inline id values are reused, they are not 
sufficientby themselves to locate any object, other than indirectly 
through its ancestors.

The generic inline group can be arbitrarily moved in a P. That seems to
means that all id values in that P need to be adjusted after such 
movement. [Any editing to add, reorder, or delete objects in a P has
the same problem.] Perhaps this is ruled out in the localization NLP.

Revisions across a set of localized versions of a document could use
opentag versions to linearize the structure and provide some degree
of common navigation to common places. There are no attributes that
would indicate any revision control.

>From this cursory look, I believe OpenTag still needs significant work.

Regards/Harvey Bingham
mailto:hbingham@acm.org

 ----------------------------------------------------------------------

From w3c-sgml-wg-request@w3.org Thu Mar  6 00:31:55 1997
Date: Wed, 05 Mar 1997 22:29:44 -0800
From: bosak@atlantic-83.Eng.Sun.COM (Jon Bosak)
Subject: Bogus XML conformance claim


As the chairman of the committee developing XML, I'd like to thank you
for your attempts at XML compliance.  However, analysis shows that the
claim to conformance with WD-xml-961114 as currently stated in
opentag.dtd Draft version 3 is false (see below).  Until you make the
changes needed to bring OpenTag into compliance with the XML
specification, I strongly suggest that you drop the claim, as it can
only reflect badly on your expertise and credibility.

I welcome your participation in the XML effort and hope that you will
soon be able accurately to claim XML compliance for OpenTag.

Sincerely,

Jon

----------------------------------------------------------------------
 Jon Bosak, Online Information Technology Architect, Sun Microsystems
----------------------------------------------------------------------
 2550 Garcia Ave., MPK17-101,           |  Best is he that inuents,
 Mountain View, California 94043        |  the next he that followes
 Davenport Group::SGML Open::ANSI X3V1  |  forth and eekes out a good
 ::ISO/IEC JTC1/SC18/WG8::W3C SGML ERB  |  inuention.
----------------------------------------------------------------------



 Date: Wed, 05 Mar 1997 11:45:10 -0500
 To: w3c-sgml-wg@w3.org
 From: "Eve L. Maler" <elm@arbortext.com>
 Subject: Re: Kind of surprising commercial XML application

 In case anyone is curious, here are the ways in which the opentag.dtd isn't
 XML compliant (some of the ways are trivial, some not):

 - #PCDATA must come at the beginning of a content model (%paratxt;, u, ...)
 - #PCDATA must always use * occurrence indicator (u, ...)
 - #PCDATA can't be in a model with pernicious mixed content (lvl)
 - Exceptions not allowed (b, ...)
 - OMITTAG specifications not (yet :-) allowed
 - NUMBER declared value not allowed (p, ...)
 - Comments not <!--* *--> (but this isn't in the spec version they used)

 Designing and implementing DTDs according to the XML constraints is a new
 discipline...

	 Eve



 Date: Thu, 06 Mar 1997 00:14:02 -0500
 To: w3c-sgml-wg@w3.org
 From: Harvey Bingham <hbingham@ACM.org>
 Subject: Re: Kind of surprising commercial XML application (long)

 Executive Summary: The primary connection to this wg is their marketing 
 hype, claiming they do XML. The novel idea for natural language
 processing is that text zones can be extracted from a word-processing
 proprietary format and tagged to the opentag DTD, manipulated (natural
 language translation of those text zones), and then restored to the
 places where the original text zones were, yielding the same formatting
 for a new document in the different natural language. Custom
 translators to and from the proprietary format are needed for the
 25 elements of the OpenTag DTD.

 ----
 1. Analysis:

 OpenTag focuses on localization (natural langage processing, NLP) of text
 in documents in any proprietary formatting text processor. It uses its
 standard DTD that will support open data encoding methods during 
 the localization process. It provides a 25-element DTD that assertedly
 preserves the significant clues useful for producing natural language
 translation of each text object. Presumably translation is easier using 
 the opentag markup. When the translation is complete, the reverse process
 occurs: it substitutes the translated text objects back into the 
 corresponding places in the original formatted document, so all the
 original formatting markup is reused. The SGML version is only the 
 intermediary. It need not persist beyond the translation process.

 They assert "XML compliant (and therefore SGML compliant as well)". They
 aren't there yet. [See Eve's message as well.]

 The OpenTag DTD preserves a few styles that might provide aids to
 translation [but not font, face, pointsize information, that might
 convey hierarchy or captioning]. Presumably a human translator
 would have the original formatted document to work with that could
 convey this other information.

 A tagged instance has sequential numeric ids assigned to 
 paragraphs in a document and again ordered sequential numeric ids to 
 each inline textual object in each paragraph. As the content model 
 includes tags for inlines in a paragraph, the mixed content #PCDATA 
 parts would remain in the same order separated by injections of inlines.
 A generic x in-line code tag is available, that might be used to
 surround #PCDATA.

 It handles footnotes, index terms, and bibliographic references. 
 It preserves column breaks and table cell breaks. 

 They allow (character) codeset change within a document from
 that specified in an initial PI.

 Comparing different localizations for the same formatting language X
 of the same original would all share the ordered ids. 

 Transformation to a different target formatting language Y, such as RTF,
 HTML, Wordperfect, etc. would not generally be possible, unless there
 were a richer way to assign corresponding style recognizers and generators.

 Some formatting languages such as MicroSoft Powerpoint scramble object
 order. There is no implicit left-to-right, top-down text progression
 as is true for western languages in most word-processing format languages.
 (Even in those, table cell line-wrapping can break that model.)

 I believe the opentag document instances will lose much information 
 in the translation process of mapping everything onto their 25 generic
 elements, nine of which are EMPTY. The consequence is that they will 
 not be able to generally go from X to different formatting language Y.

 2. The DTD -- opentag.dtd

 The OpenTag DTD claims conformance with
 "XML, Extensible Markup Language, working draft 11-14-96".

 The assertion of compatiblity with XML is commendable, but overstated.

 Omissible tag indicators should not be present. Or, if they are,
 the omissible endtag on a content model is improper:

 <!ELEMENT P - O ...> is inconsistent with XML.

 All hierarchic structures are flattened into a sequence of paragraphs P.

 Self-exclusion exceptions occur, on b,i,u,d bold, italics, underline,
 doubleunderline) indicate a style attribute on something -- but why stop
 with those few? smallcaps, overbar, strikethrough, subscript, superscript, etc.]

 <!ELEMENT so - - RCDATA>

 RCDATA is disallowed in XML.

 There is no element declaration for level, used in the ixd content
 model. Instead there is a lvl element, with pernicious mixed
 content, ((%paratxt;)*,so) that is illegal in XML.

 The PB tag semantics refer to an otherwise undefined definition group.


 3. The OpenTag Format Specifications -- otspecs.htm

 This includes a markup reference to XML, character entities (either
 SGML &name;, or Unicode &#xxxx;), a file header, and a tag description
 table, providing some semantics. 
 An initial PI file header describes version, type, encoding, locale, 
 and original. This suggests document==file. There is no mention of
 external entities.

 No attributes are on the base document element, opentag.
 Almost universal is an id NUMBER attribute. Few other attributes exist.
 The element type distinguishes use: target or reference for the id value.

 The example for paragraph (which has a content model) "<p id=1/>" is 
 inconsistent with SGML. It uses the XML empty tag variant "/>" in 
 place of tagc, not now present in SGML.

 I am confused by the use of id as a numeric (not ID) attribute. It is
 appropriate for the textual definitions of footnotes,
 index terms, and (biographic) references -- FND, IXD and
 RFD, but also in their EMPTY points of reference in FN, IX, and RF. 
 As the latter are actually references, in SGML I believe it would be 
 better with a distinguishing attribute name, possibly "idr", rather than
 overload id. I see no reason why the "idr" need to be unique in the
 file. Several references to the same object should be allowed.
 [The end of the RFD semantics seems in error, referring to footnote
 definition, rather than reference definition.]

 The use of sequential integers starting from 1 at each <p> for each 
 in-line makes any use of the id values during language processing
 awkward. Since each paragraph has its own sequence, presumably the 
 paragraph ids, unique to the document, also a set of sequential numbers
 starting at 1 as qualifiers to disambiguate the ids of the in-line children.
 But since the inline id values are reused, they are not 
 sufficientby themselves to locate any object, other than indirectly 
 through its ancestors.

 The generic inline group can be arbitrarily moved in a P. That seems to
 means that all id values in that P need to be adjusted after such 
 movement. [Any editing to add, reorder, or delete objects in a P has
 the same problem.] Perhaps this is ruled out in the localization NLP.

 Revisions across a set of localized versions of a document could use
 opentag versions to linearize the structure and provide some degree
 of common navigation to common places. There are no attributes that
 would indicate any revision control.

 From this cursory look, I believe OpenTag still needs significant work.

 Regards/Harvey Bingham
 mailto:hbingham@acm.org