[Mirrored from: http://www.ornl.gov/sgml/wg8/document/1875.htm]

WG8 N1875


Document Processing and Related Communication --

Document Description and Processing Languages

TITLE:U.S. Contribution on SGML Review
PROJECT: JTC1.18.15.1
PROJECT EDITOR: Charles F. Goldfarb
STATUS: National Body Contribution
ACTION:For information
DATE: 11 November 1996
DISTRIBUTION: WG8 and Liaisons
REPLY TO:Dr. James David Mason
(ISO/IEC JTC1/SC18/WG8 Convenor)
Lockheed Martin Energy Systems
Information Management Services
1060 Commerce Park, M.S. 6480
Oak Ridge, TN 37831-6480 U.S.A.
Telephone: +1 423 574-6973
Facsimile: +1 423 574-0004
Network: masonjd@ornl.gov

U.S. Contribution

At last month's meeting, the US national body drafted the following submission. One issue for further discussion is whether some of the new controls it proposes should be declared in the SGML declaration, the DTD, or both. Another is, if the proposal in 3c below is accepter, whether the cro delimiter should be recognized (but erroneous) within parameter literals in the concrete syntax parameter of the SGML declaration.

The US national body recommends that the following be considered during the ongoing review of ISO 8879:

  1. During the expected lifetime of a modified SGML standard, a growing number of documents will be created with the aid of SGML-aware editors and other software. Therefore, WG8 should minimize the time it spends, and the time that a modified standard requires implementors of parsers to spend, on details that are primarily relevant to an environment where users enter SGML markup directly. However, remember:
    1. SGML documents must remain viewable and readable by human beings.
    2. Automatically generated SGML documents should not be favored to the extent that manually edited documents become difficult to create. For example, the current standard ignores the RE at the end of a record that contains only an included subelement but in general does not ignore the RE at the end of a record that contains only proper subelements. Users of an SGML text editor that automatically controls use of REs may not be aware of such distinctions. Therefore, they may create DTDs where they do not consider REs when deciding whether to use inclusions. After an extensive project, which includes thorough testing with the SGML-aware editor, the DTD may cause confusion if it is eventually distributed to users who create documents where REs are manually inserted.
  2. The following items pertain to SGML record handling:
    1. A parser ignores the first RE in an element and the last RE in an element unless an RS, data, or proper subelement occurs between the RE and the start or end of the element. The current standard treats ignored REs in mixed content as data and hence requires them to satisfy a #PCDATA token. Therefore, document fragments such as
       <!ELEMENT a - - (b, #PCDATA)> ... <a> <b>
      are erroneous because the RE after the <a>, even though it would be ignored if legal, is data that appears in a context where data is not permitted. The recommended change is to determine whether to ignore an RE before matching content to the relevant model group. If an RE is ignored, it will not be matched to the content model.
    2. Currently, separator characters that occur in mixed content are always data. Instead, such characters should be ignored if they occur in a context where data is not permitted. Thus, the currently erroneous
       <!ELEMENT a - - (b, #PCDATA)> ... <a> <b>...
      (which is nonconforming because of the space between the two start-tags), would become conforming with that space ignored. However, the similar fragment:
       <!ELEMENT a - - (b|#PCDATA)> ... <a> <b>...
      would remain in error, because the space would determine that the #PCDATA branch of the or-group was taken.
    3. Treatment of REs other than at the start or end of an element should be decoupled from the use of inclusions. An element is said to be RE preserving if the RE at the end of a record containing only such elements is not ignored. In the current standard, proper subelements are RE preserving and inclusions are not. A new declaration should be added to allow the document type declaration to control whether an element is RE preserving:
       preserve RE = mdo, "REPRES", ps+, (element type | (rni, ("ELEMENT" | "PI" | "COMMENT"))), ps+,
      ("INCLUDE" | "PROPER")?, ps+, ("YES" | "NO")?, ps*, mdc
      In clause, 7.6.1, "An RE that does not immediately follow an RS or RE is ignored if no data or proper subelement intervened" should be changed to "An RE that does not immediately follow an RS or RE is ignored if no data or RE preserving subelement intervened." Also, "An RE is deemed to occur ... after any ... included subelement" should be changed to "An RE is deemed to occur ... after any ... non-RE preserving subelement." A generic identifier appears in the element type to indicate that the declaration applies to elements of the specified type. #ELEMENT indicates that the declaration applies to all elements for which no explicit declaration is present. #PI means it applies to processing instructions, #COMMENT to comments. If specified, INCLUDE means the declaration applies only to inclusions of one of the specified types; PROPER means it applies only to proper subelements. INCLUDE and PROPER cannot be specified for #PI or #COMMENT. YES means the element, processing instruction, or comment is RE preserving; NO means it is not.
      [produce] the same behavior as the current standard.
    4. SGML applications frequently choose to ignore RE characters near the start or end of an element even when these characters are passed to them by an SGML parser as significant data. Providing more control of such REs in a document type declaration would simplify application development and help ensure consistency among multiple implementations of an application. The following declaration provides such control:
       record break = mdo, "RECORDBK", (element type | (rni, "ELEMENT")), ps+, ("PROCESS" | "IGNORE"), ps+, ("PROCESS" | "IGNORE"), ps+, ("PROCESS" | "IGNORE"), ps+, ("PROCESS" | "IGNORE"), ps*, mdc
      The declaration pertains to all elements whose generic identifier is listed in the element type; if #ELEMENT is specified it applies to all elements. The four PROCESS or IGNORE keywords pertain in order to the last RE preceding an element, first RE in the element, last RE in the element, and first RE following an element provided no data or RE-preserving element intervened. Current behavior is achieved with
    5. RS is currently ignored when it is not used as markup. A new parameter in the SGML declaration should allow a list of characters to be specified that will be ignored when not used as markup. When RS does not appear in the list, RS characters will be ignored under control of the record break declaration discussed above.
    6. A new feature is needed to disable use of RE as a reference close.
    7. A concrete syntax currently must assign a character number to RE and RS. The function character identification parameter of the SGML declaration should be extended to allow the reserved name "SYSTEM" to be specified instead of a number. "SYSTEM" means that RE or RS becomes a pseudo-character or signal similar to Ee and need not appear in a storage object as a separate character. For example, a system might choose to infer an RS at the beginning of an SGML document. Provision for such signals should be added to the specification for formal system identifiers.
  3. The following items pertain to character sets:
    1. A numeric character reference with a number greater than the code set size of the document character set is invalid (the code set size is 2^n-1 where n is the code set width as defined in WG8/N1855).
    2. Does the current standard support 6- and 7-bit character sets? If so, what are backwards compatibility issues for the proposed definition of code set size? In fact, does definition 4.43 or some other part of the standard imply that all bit combinations of the chosen size be included in the code set or merely a consecutive subset thereof? These questions become irrelevant by dropping the requirement to specify each character number exactly once in the described character set portions under the following interpretation. There are 3 classes of characters: those that can be entered directly or through character references, those that can be entered only through character references, and those that are prohibited. The first class is declared with a minimum literal or base set character number, the second by "UNUSED", and the third by omitting the character number from the described character set portions. In addition to clarifying the interpretation of the character set declaration, this approach provides users with a way to prohibit references to undefined characters.
    3. To increase human readability of the SGML declaration, characters that occur directly within parameter literals in the concrete syntax parameter of the SGML declaration, such as those that appear in naming rules, should be interpreted in the document character set. To ensure that such characters also are available in the syntax- reference character set (as they must if they are required in the concrete syntax), characters that appear directly must be minimum data characters. Character numbers used in numeric character references, however, must refer to the syntax-reference character set. To emphasize that a parameter literal in the concrete syntax is interpreted differently from an identical parameter literal that appears elsewhere in the document, a different delimiter is used to introduce character references. In particular, instead of the cro, the srcro (syntax-reference character reference open) is used. This delimiter can never be changed because it is used only in the SGML declaration. It is "%#".
    4. To increase the convenience of entering character numbers, in particular those to code points in large character sets, the syntax of character numbers is extended to allow constructs such as
       D#412 95/612 O#777 H#abcd/ffff
      Here, D, O, and H respectively indicate that the following number is decimal, octal, or hexadecimal. / separates bytes: they are assumed to be 8-bit bytes, but a new parameter in the SGML declaration can specify the byte size. The # and / characters are new delimiters (cns and cnd for character number start and character number delimiter) recognized only within character numbers. A numeral within a character reference is terminated by a character not allowed within numerals of the appropriate base: 8 and 9 ends an octal numeral, any letter not permitted in a hexadecimal numeral terminates such a numeral.
      One use of character numbers is, of course, in character references. The following character references parallel the above examples:
       &#D#412; &#95/612; &#O#777; &#H#abcd/ffff;
    5. A new declaration
       auxiliary character set = mdo, "CHARSET", ps+, name, ps+, external identifier, ps+, number, (ps+, number)?, ps*, mdc
      can appear in a document type declaration. This declaration indicates that the specified name identifies a character set to be used in character references. The first number is the width of this additional character set and the second number, if present, is the byte size. This declaration enables character references such as
       &#/cs/365; &#/cs/charname; &#/cs/H#abcd/8b7d;
      where / is the cnd defined above, "cs" is the name of a declared auxiliary character set and "charname" is a name defined in the character set. The parser passes the external identifier of the character set and the name or number to the application; it is up to the application to determine if the name is meaningful.
    6. For backwards compatibility, the extensions to character references discussed above are enabled by a new optional feature (otherwise, a document containing a string such as "&#231/455" is interpreted differently under the current and changed standards).

Lynne A. Price

Text Structure Consulting