[Mirrored from: http://www.ornl.gov/sgml/wg8/document/1875.htm]
ISO/IEC JTC 1/SC 18
Document Processing and Related Communication --
Document Description and Processing Languages
Contribution on SGML Review |
||Charles F. Goldfarb|
||National Body Contribution|
||11 November 1996|
||WG8 and Liaisons|
TO:||Dr. James David Mason
(ISO/IEC JTC1/SC18/WG8 Convenor)
Lockheed Martin Energy Systems
Information Management Services
Commerce Park, M.S. 6480
Oak Ridge, TN 37831-6480 U.S.A.
+1 423 574-6973
Facsimile: +1 423 574-0004
At last month's meeting, the US national body drafted the following
submission. One issue for further discussion is whether some of the new controls
it proposes should be declared in the SGML declaration, the DTD, or both.
Another is, if the proposal in 3c below is accepter, whether the cro delimiter
should be recognized (but erroneous) within parameter literals in the concrete
syntax parameter of the SGML declaration.
The US national body recommends that the following be considered during the
ongoing review of ISO 8879:
- During the expected lifetime of a modified SGML standard, a growing number
of documents will be created with the aid of SGML-aware editors and other
software. Therefore, WG8 should minimize the time it spends, and the time that a
modified standard requires implementors of parsers to spend, on details that are
primarily relevant to an environment where users enter SGML markup directly.
- SGML documents must remain viewable and readable by human beings.
- Automatically generated SGML documents should not be favored to the extent
that manually edited documents become difficult to create. For example, the
current standard ignores the RE at the end of a record that contains only an
included subelement but in general does not ignore the RE at the end of a record
that contains only proper subelements. Users of an SGML text editor that
automatically controls use of REs may not be aware of such distinctions.
Therefore, they may create DTDs where they do not consider REs when deciding
whether to use inclusions. After an extensive project, which includes thorough
testing with the SGML-aware editor, the DTD may cause confusion if it is
eventually distributed to users who create documents where REs are manually
- The following items pertain to SGML record handling:
- A parser ignores the first RE in an element and the last RE in an element
unless an RS, data, or proper subelement occurs between the RE and the start or
end of the element. The current standard treats ignored REs in mixed content as
data and hence requires them to satisfy a #PCDATA token. Therefore, document
fragments such as
<!ELEMENT a - - (b, #PCDATA)> ... <a> <b>
are erroneous because the RE after the <a>, even though it would
be ignored if legal, is data that appears in a context where data is not
permitted. The recommended change is to determine whether to ignore an RE before
matching content to the relevant model group. If an RE is ignored, it will not
be matched to the content model.
- Currently, separator characters that occur in mixed content are always
data. Instead, such characters should be ignored if they occur in a context
where data is not permitted. Thus, the currently erroneous
<!ELEMENT a - - (b, #PCDATA)> ... <a> <b>...
(which is nonconforming because of the space between the two
start-tags), would become conforming with that space ignored. However, the
<!ELEMENT a - - (b|#PCDATA)> ... <a> <b>...
would remain in error, because the space would determine that the
#PCDATA branch of the or-group was taken.
- Treatment of REs other than at the start or end of an element should be
decoupled from the use of inclusions. An element is said to be RE preserving if
the RE at the end of a record containing only such elements is not ignored. In
the current standard, proper subelements are RE preserving and inclusions are
not. A new declaration should be added to allow the document type declaration to
control whether an element is RE preserving:
preserve RE = mdo, "REPRES", ps+, (element type | (rni, ("ELEMENT" | "PI" | "COMMENT"))), ps+,
("INCLUDE" | "PROPER")?, ps+, ("YES" | "NO")?, ps*, mdc
In clause, 7.6.1, "An RE that does not immediately follow an RS or
RE is ignored if no data or proper subelement intervened" should be changed
to "An RE that does not immediately follow an RS or RE is ignored if no
data or RE preserving subelement intervened." Also, "An RE is deemed
to occur ... after any ... included subelement" should be changed to "An
RE is deemed to occur ... after any ... non-RE preserving subelement." A
generic identifier appears in the element type to indicate that the declaration
applies to elements of the specified type. #ELEMENT indicates that the
declaration applies to all elements for which no explicit declaration is
present. #PI means it applies to processing instructions, #COMMENT to comments.
If specified, INCLUDE means the declaration applies only to inclusions of one of
the specified types; PROPER means it applies only to proper subelements. INCLUDE
and PROPER cannot be specified for #PI or #COMMENT. YES means the element,
processing instruction, or comment is RE preserving; NO means it is not.
<!REPRES #ELEMENT INCLUDE NO> <!REPRES #ELEMENT PROPER YES> <!REPRES #PI NO> <!REPRES #COMMENT NO>
[produce] the same behavior as the current standard.
- SGML applications frequently choose to ignore RE characters near the start
or end of an element even when these characters are passed to them by an SGML
parser as significant data. Providing more control of such REs in a document
type declaration would simplify application development and help ensure
consistency among multiple implementations of an application. The following
declaration provides such control:
record break = mdo, "RECORDBK", (element type | (rni, "ELEMENT")), ps+, ("PROCESS" | "IGNORE"), ps+, ("PROCESS" | "IGNORE"), ps+, ("PROCESS" | "IGNORE"), ps+, ("PROCESS" | "IGNORE"), ps*, mdc
The declaration pertains to all elements whose generic identifier is
listed in the element type; if #ELEMENT is specified it applies to all elements.
The four PROCESS or IGNORE keywords pertain in order to the last RE preceding an
element, first RE in the element, last RE in the element, and first RE following
an element provided no data or RE-preserving element intervened. Current
behavior is achieved with
<!RECORDBK #ELEMENT PROCESS IGNORE IGNORE PROCESS>
- RS is currently ignored when it is not used as markup. A new parameter in
the SGML declaration should allow a list of characters to be specified that will
be ignored when not used as markup. When RS does not appear in the list, RS
characters will be ignored under control of the record break declaration
- A new feature is needed to disable use of RE as a reference close.
- A concrete syntax currently must assign a character number to RE and RS.
The function character identification parameter of the SGML declaration should
be extended to allow the reserved name "SYSTEM" to be specified
instead of a number. "SYSTEM" means that RE or RS becomes a
pseudo-character or signal similar to Ee and need not appear in a storage object
as a separate character. For example, a system might choose to infer an RS at
the beginning of an SGML document. Provision for such signals should be added to
the specification for formal system identifiers.
- The following items pertain to character sets:
- A numeric character reference with a number greater than the code set size
of the document character set is invalid (the code set size is 2^n-1 where n is
the code set width as defined in WG8/N1855).
- Does the current standard support 6- and 7-bit character sets? If so, what
are backwards compatibility issues for the proposed definition of code set size?
In fact, does definition 4.43 or some other part of the standard imply that all
bit combinations of the chosen size be included in the code set or merely a
consecutive subset thereof? These questions become irrelevant by dropping the
requirement to specify each character number exactly once in the described
character set portions under the following interpretation. There are 3 classes
of characters: those that can be entered directly or through character
references, those that can be entered only through character references, and
those that are prohibited. The first class is declared with a minimum literal or
base set character number, the second by "UNUSED", and the third by
omitting the character number from the described character set portions. In
addition to clarifying the interpretation of the character set declaration, this
approach provides users with a way to prohibit references to undefined
- To increase human readability of the SGML declaration, characters that
occur directly within parameter literals in the concrete syntax parameter of the
SGML declaration, such as those that appear in naming rules, should be
interpreted in the document character set. To ensure that such characters also
are available in the syntax- reference character set (as they must if they are
required in the concrete syntax), characters that appear directly must be
minimum data characters. Character numbers used in numeric character references,
however, must refer to the syntax-reference character set. To emphasize that a
parameter literal in the concrete syntax is interpreted differently from an
identical parameter literal that appears elsewhere in the document, a different
delimiter is used to introduce character references. In particular, instead of
the cro, the srcro (syntax-reference character reference open) is used. This
delimiter can never be changed because it is used only in the SGML declaration.
It is "%#".
- To increase the convenience of entering character numbers, in particular
those to code points in large character sets, the syntax of character numbers is
extended to allow constructs such as
D#412 95/612 O#777 H#abcd/ffff
Here, D, O, and H respectively indicate that the following number is
decimal, octal, or hexadecimal. / separates bytes: they are assumed to be 8-bit
bytes, but a new parameter in the SGML declaration can specify the byte size.
The # and / characters are new delimiters (cns and cnd for character number
start and character number delimiter) recognized only within character numbers.
A numeral within a character reference is terminated by a character not allowed
within numerals of the appropriate base: 8 and 9 ends an octal numeral, any
letter not permitted in a hexadecimal numeral terminates such a numeral.
One use of character numbers is, of course, in character references. The
following character references parallel the above examples:
&#D#412; _/612; &#O#777; &#H#abcd/ffff;
- A new declaration
auxiliary character set = mdo, "CHARSET", ps+, name, ps+, external identifier, ps+, number, (ps+, number)?, ps*, mdc
can appear in a document type declaration. This declaration indicates
that the specified name identifies a character set to be used in character
references. The first number is the width of this additional character set and
the second number, if present, is the byte size. This declaration enables
character references such as
&#/cs/365; &#/cs/charname; &#/cs/H#abcd/8b7d;
where / is the cnd defined above, "cs" is the name of a
declared auxiliary character set and "charname" is a name defined in
the character set. The parser passes the external identifier of the character
set and the name or number to the application; it is up to the application to
determine if the name is meaningful.
- For backwards compatibility, the extensions to character references
discussed above are enabled by a new optional feature (otherwise, a document
containing a string such as "ç/455" is interpreted differently
under the current and changed standards).
Lynne A. Price
Text Structure Consulting