[Mirrored from: http://www.ornl.gov/sgml/wg8/document/1855.htm]

WG8 N1855


Document Processing and Related Communication—

Document Description and Processing Languages

TITLE:Third Interim Report on the Project Editor's Review of ISO 8879
PROJECT: JTC1.18.15.1
PROJECT EDITOR: Dr. Charles F. Goldfarb
STATUS: Approved report
ACTION: For information
DATE: 24 May 1996
DISTRIBUTION: WG8 and Liaisons
REPLY TO: Dr. James D. Mason
(ISO/IEC JTC1/SC18/WG8 Convenor)
Oak Ridge National Laboratory
Information Management Services
Bldg. 2506, M.S. 6302, P.O. Box 2008
Oak Ridge, TN 37831-6302 U.S.A.
Telephone: +1 423 574-6973
Facsimile: +1 423 574-6983
Network: masonjd@ornl.gov

WG8 has directed the Project Editor of ISO 8879 (SGML) to conduct a systematic review of the standard to consider future development. As the review is in its final stages and participation is increasing, this report incorporates the substance of both previous reports (N1605 and N1701) and some other documents in order to reduce the number of documents that need be accessed by new participants.

I. Principles and policy

WG8 has agreed to a set of principles for any future development (JTC1/SC18/WG8 N1289, attached below). These principles ensure that all existing conforming SGML documents will continue to conform after any changes are made to the standard.

WG8 has also adopted a policy for the review (JTC1/SC18/WG8 N1350, attached below). The review process ensures that every clause, paragraph, note, and syntax production of ISO 8879 is reviewed.

At the Munich meeting in May, 1996, the following additional policy provisions were adopted:

  1. The standard shall be designed so that it is practical for documents to be created with ordinary (i.e., not SGML-aware) text editors. This principle does not justify the addition of excessive markup minimization techniques.
  2. A major objective of SGML is that SGML documents shall be processable by any conforming SGML system. Although a conforming SGML system could store documents in a proprietary format that describes portions of the SGML information set, such a format alone is insufficient for conformance. It is therefore required that conforming SGML systems, if they read pre-existing SGML documents, shall be able to import and, if they write SGML documents, shall be able to export those SGML documents in a "plain text" storage format. A "plain text" storage format is provisionally defined as one that is capable of being read and written by commonly used text editors and file utilities that do not inherently interpret specialized notations such as SGML or proprietary formats of particular products. Some examples of editors whose native storage format is plain text are DOS EDIT and EDLIN, Windows Notepad (but not Write or Wordpad), UNIX EMACS and VI, and Macintosh TeachText and SimpleText.

II. Review activities

The review has been structured in two related activities:

Activity 1. Information described by SGML markup

The objective of this activity is to define explicitly the information described by the SGML syntax and to group it, as appropriate, into useful "information sets", such as the Element Structure Information Set (ESIS) (see annex A of ISO/IEC DIS 13673).

This activity is now complete. The "SGML Property Set" defines the full set of information described by SGML markup. This work was done in conjunction with DSSSL and HyTime experts and appears in both the DSSSL standard and the SGML Extended Facilities (formerly "General Facilities") annex of the HyTime standard. We are proposing that it be incorporated in the revision of ISO 8879.

Activity 2. Proposed changes to ISO 8879

The objective of this activity is to identify changes required to correct or enhance the text of ISO 8879, and to publish a revised edition of the standard that incorporates those changes and the changes made by Amendment 1 (1988).

The activity will be conducted in the following sequence:

  1. Clause-by-clause examination of the standard

    This step is currently in process. It will result in a list of requirements expected to be satisfied by the revision of SGML. Interim reports are being published as the work proceeds. These reports include requirements gathered from comments submitted by participants as well as from the systematic examination of the standard.

  2. Member Body approval of requirements to be satisfied

    The list of requirements generated by the first step will be reviewed for technical accuracy. A final list of "Requirements to be Satisfied by the Revision of ISO 8879", including the expected changes needed to satisfy the requirements, will be submitted for Member Body approval. Also submitted will be the reasons why the list fails to include any requirement that had previously been identified as a requirement expected to be satisfied.

  3. Preparation and balloting of text of changes

    Text will be prepared for the changes needed to satisfy the approved requirements and will be ballotted in accordance with ISO directives.

  4. Publication of revised ISO 8879

    Upon approval of the text of the changes, the changes will be incorporated into the text of ISO 8879, together with Amendment 1, and the integrated text will be published.

The review has progressed sufficiently that we can state that changes will be recommended. In order to acquaint the SGML community with the types of change we are contemplating, we have prepared an interim list of requirements and associated changes. Please note that the list is by no means complete with respect either to the set of requirements or the possible changes associated with each requirement. Nor do we believe it to be statistically representative of the changes that we will eventually recommend. The list follows, in no particular order (except for grouping together items from the same earlier report):

(From N1605)

Requirement: To facilitate automatic translation of well-defined document character sets with human inspectability of the translation information (Note: This information is not required for SGML parsing.):

  1. Add the ability to specify when the definition of a character set is represented using the format of the character set description parameter of the SGML declaration.
  2. Add the ability, within a character set description, to identify a base set character by its string name (not just by its character number).
  3. Add the ability to indicate when the string name for a base set character is taken from the ISO10646 character registry, rather than being the name used in the base set definition itself. (For ISO sets, the two are normally the same.)
  4. Allow the base set of a formal character set definition to be itself a formal character set definition. This allows subsets of large character sets to be used in the definition of other character sets.

Requirement: To facilitate automatic generation of display character entity sets from definitional character entity sets.

For definitional character entity sets, add the ability to define the entity text as either a character in a defined character set and/or by an ISO10646 character registry name.

Requirement: To facilitate more powerful management of sophisticated versioning requirements.

Allow Boolean combinations of INCLUDE and IGNORE keywords in marked section declarations.

Requirement: To enhance the usefulness of capacities.

  1. Make specification of any or all capacity limits in a document optional.
  2. Make an SGML system's support for any or all capacity limits optional.
  3. Add capacity limits for document instances, such as number of elements, number of data characters, etc.
  4. Allow optional specification of actual capacities (not just capacity limits).

Requirement: To provide more flexibility for attribute design.

  1. Allow a given name token to appear in the declared value of more than one attribute in the same attribute definition list.
  2. Allow multiple attribute definition list declarations for a single element type.

Requirement: To simplify the specification of SGML declarations.

Allow reference to an existing SGML declaration with local modifications.

Requirement: To allow more choices among optional SGML features.

Modularize the SHORTTAG feature so that attribute minimization can be used with or without allowing empty tags.

Requirement: To clarify difficult portions of the text of the standard.

Clarify the description of record boundary handling, explaining clearly the relationship between detecting data characters and ignoring characters, including whether ignored characters are first recognized as data characters.

(From N1701)

Requirement: To support multi-byte character sets with greater convenience

Devise a less burdensome method for declaring long sequences of character numbers.

Requirement: To facilitate name-space modularization in DTDs and LPDs

[The report did not include possible changes that address this requirement.]

(From this meeting)

Requirement: To support the requirements identified by an earlier review.

The recommendations in WG8 N1035 are considered to be incorporated into this list.

Requirement: To support multi-byte character sets with greater convenience

  1. Provide a way to declare name characters that have no upper/lower case distinction without having to repeat the characters.
  2. Provide a uniform character range specification mechanism and use it uniformly (e.g., in the UC/LCNMSTRT and SHORTREF specifications) to add large chunks of characters concisely).
[WG8 N1854 addresses this requirement further.]

Requirement: To support multi-lingual text

Provide a means for specifying more than one language in the public text language of a formal public identifier

Requirement: To enable SGML documents to be interchanged in a plain text format that is accessible by commonly used text editors and file utilities that are not inherently capable of interpreting SGML.

Revise the conformance clause to require that conforming SGML systems be able to use a plain text format in addition to any other format that they use.

Requirement: To integrate document architectures, formal system identifiers, and property sets and groves into SGML.

Incorporate the SGML Extended Facilities of the revised ISO/IEC 10744 into ISO 8879.

Requirement: To correct editorial errors

  1. In clause 5.1 state that Figures 1 - 4 are based on the reference concrete syntax and that the character numbers used are those of the syntax-reference character set.
  2. Remove references to ISO 2022.

Requirement: To support simple external references in SGML

It shall be possible, in an IDREF or IDREFS attribute value, to specify the name of the entity that defines the name space in which a referenced ID occurs. The syntax is "#ENTITY entity-name ID" (or possibly a list of IDs after which a suitable reserved word, such as #THISDOC, could be used to identify additional local IDs). This facility shall not be extended any further. The standard will point out that HyTime should be used for more sophisticated requirements such as indirect referencing. The standard will also warn the user of the risks of using direct referencing to external objects whose location could change. (Note: The experience of one large user has been that any object referenced more than twice should be referenced indirectly.)

Requirement: To allow any SGML document to be a subdocument

SGML declarations will optionally be allowed in SUBDOC entities.

Requirement: To extend the character model of SGML to include entity and storage management.

The standard will be based on the character model that is summarized below. Terms used that are not defined here are defined in ISO 8879 or in the Formal System Identifier Definition Requirements (FSIDR) of the SGML Extended Facilities. Some terms defined in those standards are introduced here with informal definitions that are intended to be consistent with the formal ones.

NOTE: This text is subject to revision as the Review proceeds.
  1. An SGML document is made up of entities, which are sequences of characters. Character sequences are passed to the SGML parser by the entity manager. As characters are abstract semantic constructs, and as computers can only deal with representations of abstractions, what actually passes to the parser is a non-standardized internal representation of the characters, known as the "system representation of characters" (SysRC).
  2. The SGML declaration that governs a document defines a character set known as the "document character set" (DCS). A character set is the mapping of a character repertoire onto a "code set", which is an ordered set of consecutive bit combinations of equal size. A bit combination is an ordered collection of bits, interpretable as a binary number. The size of a bit combination in a code set (the "code set width") is the smallest power of 2 that is an integral multiple of 8 and which is such that 2, raised to that power, is at least 1 greater than the largest bit combination in the code set.
  3. The DCS maps both significant characters and purely data characters to bit combinations in the code set. The base-10 integer equivalent of the bit combination mapped to a character is the character's "character number", which can be used in numeric character references in the prolog and document instances. The DCS identifies the remaining bit combinations of the code set as being mapped to "non-SGML characters"; they have character numbers. For each bit combination in the code set, even those that are non-SGML, there is a corresponding SysRC.
  4. An entity is a virtual storage object. The entity manager actually maps an entity onto one or more real storage objects (or portions thereof). A storage object consists of "octet sequences" (or other formats) representing the stored characters, known as the "storage representation of characters" (StoRC). A StoRC is invalid if it does not represent an SGML character.
  5. In the course of accessing storage objects, the entity manager, in conjunction with the storage manager(s), could invoke processes such as conversion, decompression, or decryption whose inverses might have been applied when the StoRC was stored, in addition to mapping from the StoRC to the character semantics (and therefore to the SysRC). Such processes and mappings could either be intrinsic to the storage manager or specified as an attribute in the storage object specification in the FSI.
  6. The StoRC is mapped to the SysRC in one of two ways: either with or without the use of the document character set.
    1. If the storage object is DCS-dependent a "bit combination transformation format" (BCTF) is applied to the StoRC to produce bit combinations, which are in turn converted to characters (and therefore to the SysRC) using the document character set.
    2. Without the DCS, the storage object is associated with a mapping from the StoRC to characters (and therefore to the SysRC). This mapping, which could imply transformation of an octet sequence, is known as a "character repertoire encoding" (encoding).
  7. DCS-dependent conversions are accomplished by the parser giving the entity manager the mapping from the bit combinations to the SysRC. The entity manager in turn passes it to the storage manager, which uses that to get from the StoRC to the SysRC.
  8. Whether or not the DCS is used for mapping the StoRC to the SysRC, it always exists. Therefore, there is always a mapping from the DCS to the SysRC. That mapping is used to interpret numeric character references in the prolog and document instances, and also to determine whether a StoRC is invalid because it does not represent an SGML character.
  9. The SGML declaration also defines a character set known as the "syntax-reference character set". Its purpose is to allow characters used in the defined concrete syntax to be referenced within the SGML declaration by character numbers that are not dependent on the DCS. It does not play a role in the mapping from the StoRC to the SysRC.
  10. A "char" in a grove is an abstract data type representing a character (defined above as an abstract semantic construct). The concrete representation of a grove must allow any char to be distinguished from all other chars, and for the semantics of each char to be determined. The concrete representation may or may not rely on the DCS, UNICODE, and/or other character sets and repertoires.



TITLE: Future development of ISO 8879
REQUESTED ACTION: For information
DATE: 7 May 1990
DISTRIBUTION: WG8 and Liaisons

The future development of ISO 8879 shall be consistent with the following principles:

  1. Any document that is a conforming SGML document according to the current standard shall continue to be a conforming document under the provisions of future versions of the standard.
  2. The results of parsing an SGML document (that is, the element structure information set, or "ESIS") that conforms to the current standard shall be unchanged when the document is parsed under the provisions of future versions of the standard.
  3. A document that is classified as a minimal or basic conforming SGML document under the current standard shall continue to be classified as such under the provisions of future versions of the standard.
    NOTE 1 -- These principles should not be construed to mean that no changes can be made to ISO 8879. To meet evolving user requirements, for example, some changes of the following types are possible without violating the above principles:
    1. Relaxing restrictions
    2. Adding new constructs
    3. Partitioning existing optional features
    4. Introducing options to allow the suppression of troublesome existing constructs, when experience indicates that the constructs tend to induce user errors with serious consequences
  4. Future versions of the standard shall require conforming SGML parsers and systems to support conforming SGML documents, minimal conforming SGML documents, and basic conforming SGML documents to at least the same extent as the current standard.
    NOTE 2 -- Future versions of this standard can introduce additional requirements as well.
NOTE 3 -- These principles should not be construed to mean that the definition of a "conforming SGML document" cannot be changed, only that existing conforming SGML documents will continue to be classified as such.



TITLE: Policy for the Review of ISO 8879
REQUESTED ACTION: For information
DATE: 11 Oct 1991
DISTRIBUTION: WG8 and liaisons

WG8 has directed the Project Editor of ISO 8879 to conduct a detailed review of the standard to consider future development. The review process will ensure that every clause, paragraph, note, and syntax production of ISO 8879 is reviewed.

The review is not expected to result in any substantive change to the scope of ISO 8879. All proposed changes will adhere to the principles expressed in WG8 N1289, "Future development of ISO 8879."

NOTE -- WG8 N1289 mandates upward compatibility such that conforming SGML documents and applications will remain conforming regardless of changes to ISO 8879. However, it does not necessarily protect existing conforming systems: SGML parsers, for example, may need to be modified to recognize new constructs (that is, to recognize documents that do not conform to the current standard but may conform to a future version). Nor does N1289 protect nonconforming documents: text that is currently erroneous might be considered valid by a future version of ISO 8879.

If the review results in function being added to a future version of ISO 8879, there shall be a revised SGML declaration that will allow identification of the ISO 8879 version to which a document conforms.

A document conforming to the 1986 version (with a 1986 SGML declaration) shall continue to conform in any future system, and to be interpreted in exactly the same way.

NOTE -- This rule applies to all of SGML, including the element structure, entity structure, and nesting of marked sections.

It shall be possible to create an equivalent future SGML declaration for every 1986 SGML declaration. The prolog and document instance set of a 1986 conforming SGML document will be interpreted identically by any future SGML system with either of the equivalent SGML declarations.

As SGML end users and their managers can now learn about SGML in a less formal way than by studying the standard, any proposed changes to the standard resulting from the review process will be written for an audience consisting primarily of software developers, software testers, and members of standards committees.

The review process will be a single-stage process. Any future development of the standard will be done in a coherent manner, not piecemeal. The complete design from the top down will be understood before development of details is prioritized. Technical work and possible reorganization will be completed before final wording of individual paragraphs is attempted.

If experts disagree for good reason over the interpretation of some provision of the standard, the provision shall be considered ambiguous and resolution of the ambiguity shall be considered a corrigendum, rather than added function.

To protect users and implementers from having to make multiple revisions to remain current with the standard, intermediate publication of corrigenda will be considered only in the unlikely event that serious problems with the current version are encountered. However, any changes in a revised version of ISO 8879 that are in fact corrigenda will be identified as such, as they could affect the determination of whether upward compatibility has been maintained.

The user requirements for SGML as presented by each participating expert will be given equal respect, even if other experts have not encountered some requirements in their own work. SGML must accommodate all the requirements.

It is expected that not all SWG meetings will be equally well attended. Complete records will be kept of meetings to keep absent members informed, and to assure consistency of direction from meeting to meeting. In particular, major issues that were resolved at a large meeting will not be reopened at a subsequent smaller meeting in which advocates of one side or the other are not present.

[Link to ISO 8879 Review Current Information Set]