[Archive copy mirrored from: http://www.sgmlsource.com/8879rev/n1929.htm]

N1929

ISO/IEC JTC1/SC18/WG8

Document Processing and Relating Communication—

Document Description and Processing Languages

TITLE: Proposed TC for WebSGML Adaptations for SGML
SOURCE:WG8
PROJECT:JTC1.18.15.1
PROJECT EDITOR: Charles F. Goldfarb
STATUS:WG8 Approved Text
ACTION:For information
SUMMARY OF MAJOR POINTS:This Technical Corrigendum adds a normative annex K and an informative annex L to ISO 8879 to meet an urgent need for adaptations of SGML for use on the World Wide Web and intranets. It incorporates by reference the Extended Naming Rules TC.

This TC does not affect existing SGML documents or products. It affects only those SGML documents and products that choose to support the WebSGML Adaptations option.

DATE:1 June 1997
DISTRIBUTION: WG8 and Liaisons
REFER TO:ISO 8879
REPLY TO:Dr. James D. Mason
(ISO/IEC JTC1/SC18/WG8 Convenor)
Oak Ridge National Laboratory
Information Management Services
Bldg. 2506, M.S. 6302, P.O. Box 2008
Oak Ridge, TN 37831-6302 U.S.A.
Telephone: +1 423 574-6973
Facsimile: +1 423 574-6983
Network: masonjd@ornl.gov
http://www.ornl.gov/sgml/wg8/wg8home.htm
ftp://ftp.ornl.gov/pub/sgml/wg8/

TC for WebSGML Adaptations

Add the following normative annex K and informative annex L to ISO 8879.

Annex K (normative)
WebSGML Adaptations

This annex describes an optional extension of SGML known as the "WebSGML Adaptations". The extension incorporates by reference the Extended Naming Rules TC, such that a system that supports these WebSGML Adaptations also supports the Extended Naming Rules. An SGML system need not support these WebSGML Adaptations in order to be a conforming SGML system.

To distinguish SGML declarations that use the facilities of this TC from those that do not, the minimum literal in productions [171] and [200] of ISO 8879:1986 (the "version literal") must be modified to read "ISO 8879:1986 (WWW+ENR)". To accomplish this add the following sentence to the paragraph immediately following production [171] and to the second paragraph following production [200]:

However, when WebSGML Adaptations are used, the minimum data must be "ISO 8879:1986 (WWW+ENR)".

An SGML parser that supports this TC shall also be able to parse documents whose version literal indicates that they do not use the facilities of this TC. The parsing of such documents must produce the same grove as would an SGML parser that does not support this TC. Validation of such documents, however, is with respect to the standard as modified by this TC and, by reference, the ENR TC.

This annex is phrased in terms of revisions to be made to the body of this International Standard. However, these revisions are applicable only when the WebSGML Adaptations are in use.

This annex makes reference to groves and property sets, which are defined in the SGML Extended Facilities of the 2d Edition of HyTime (ISO/IEC 10744) and also in DSSSL (ISO/IEC 10179).

Note: The SGML Extended Facilities are expected to be incorporated in the forthcoming revision of this International Standard.

The WebSGML Adaptations are as follows.

Definitions

The following definitions are used in this Annex.

Definitions related to document type declarations

DTD declarations: Markup declarations that specify the part of a document type definition (DTD) that is expressable in SGML. They occur in the global subset and in the external and internal subsets of document type declarations.

External subset: The portion of a document type declaration subset referenced by the external identifier parameter of a document type declaration.

Internal subset: The portion of a document type declaration subset that occurs between the dso and dsc of a document type declaration.

Note: If there are no DTD declarations within the document type declaration in the SGML document entity, then, whether or not the dso and dsc are omitted, the internal subset is considered empty (rather than non-existent).

Definitions related to validity checking

Conforming SGML document:

An SGML document that complies with all provisions of this International Standard.

Note: The provisions allow for choices in the use of options, features, and variant concrete syntaxes.

A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both.

Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.

Type-valid SGML document:

An SGML document in which, for each document instance, there is an associated document type declaration to whose DTD that instance conforms.

Tag-valid SGML document:

An SGML document, all of whose document instances are fully-tagged. There need not be a document type declaration associated with any of the instances.

Note: If there is a document type declaration, the instance can be parsed with or without reference to it.

Fully-tagged document instance:

A document instance in which a start-tag with a generic identifier, and an end-tag, are present for every element, and the attribute name is present in every attribute specification in the start-tag.

Note: Processing without reference to DTD declarations is possible only for a fully-tagged document instance.

Definitions related to entity constraints

An SGML system that supports unconstrained SGML documents must be able to parse DTD declarations and resolve both internal and external entity references. If it continues parsing (as a form of error-recovery) after failing to access a referenced entity, the results will be unpredictable. Observing one or more of the entity constraints defined in this International Standard may cause a document to be more amenable to processing by a simpler SGML system, or in an environment (such as a network) where entity access may be slow or unreliable.

Integrally-stored document instance:

A document instance in which every element and marked section ends in the entity in which it begins.

Note: This constraint makes it possible, as a form of error-recovery, for parsing to continue in a fully-tagged document instance after a failure to access a referenced entity. The resulting grove will be the same for the parsed text, except for the tree addresses of younger siblings of the nodes in the inaccessible entity.

Reference-free document:

An SGML document that has no entity references other than delimiter entity references, although it could have attribute values that refer to entities.

Note: A reference-free document can be parsed by systems that cannot resolve entity-references. If it is tag-valid, it can also be parsed by systems that cannot parse DTD declarations.

External-reference-free document:

An SGML document that has no external entity references, although it could have attribute values that refer to external entities.

Note: External-reference-free documents can be parsed by systems that cannot resolve external entity-references.

Entity-declaration-free document:

An SGML document that contains no entity declarations.

Note: An entity-declaration-free document that contains entity references other than delimiter entity references requires IMPLYDEF ENTITY. If the document is tag-valid, it can be parsed by systems that cannot parse DTD declarations.

Other definitions

Functionally equivalent groves:

Groves that, except for certain permitted differences, are identical with respect to the portion of their grove plans that are included within the SGML default grove plan. The allowable differences apply to constructs that are parsed without respect to declarations. They are:

Note: The SGML default grove plan does not include any markup properties.

Predefined data character entity:

A general entity, associated with a character number in the syntax-reference character set, that is used to reference significant SGML characters as data.

Note: In order to allow delimiter escaping when parsing without reference to DTD declarations, data character entities should be predefined for the first character of each delimiter string that can be recognized in a mode where data can occur.

SGML declaration body:

The parameters of an SGML declaration.

White space:

The characters assigned to the SEPCHAR, SPACE, RE, and RS delimiter roles.

Specific changes

SGML declaration changes

SGML declaration reference

An alternative form of the SGML declaration is permitted, known as an SGML declaration reference. It references an SGML declaration body. SD is added as a new public text class identifying an SGML declaration body.

<!SGML public-identifier>

In order for SGML documents to be self-identifying, it is strongly recommended that all conforming SGML documents contain one of the forms of SGML declaration.

Note: For example:

<!SGML PUBLIC "IDN//W3C.ORG//SD HTML Version 3.2//EN">

White space handling

The SGML declaration provides a new feature, "white space in content rule" (WSCON), with the values KEEPALL or SGML1986 (the current rule), which is the default. If KEEPALL is specified, all white space in mixed content is included in the grove as datachar nodes, and all white space known to be in element content is included in the grove as ssep nodes.

Note: If KEEPALL is specified, the RS and RE ignoring rules in 7.6.1 do not apply.

Note: This feature does not affect delimited strings, such as attribute value literals, which have their own rules for normalizing white space (and which, in any case, do not occur in content).

Optional capacity limits

The capacity set parameter of the SGML declaration can be specified as CAPACITY NOLIMITS, indicating that there are no capacity limits.

Optional quantity limits

The quantity set parameter of the SGML declaration can be specified as QUANTITY NOLIMITS, indicating that there are no quantity limits except BSEQLEN, which has the quantity specified for it in the reference quantity set. Specifying QUANTITY NOLIMITS does not require a system to support quantities greater than those specified in its system declaration.

Hex character reference

A new hexadecimal character reference open (HCRO) delimiter is used when a numeric character reference is represented as a hexadecimal string. It is recognized when followed by a hex digit. The hex digit begins a hex digit string of no greater than NAMELEN length, which is terminated in the same way as other numeric character references. A concrete syntax need not assign a string to this delimiter. It is recognized in the same modes as CRO.

Validity assertions

The "check validity assertions" (VALCHECK) parameter allows a document to assert whether it is type-valid, tag-valid or both. It is a reportable error if an assertion is untrue.

Note: For example, if an otherwise conforming type-valid document incorrectly asserts that it is tag-valid, the document is non-conforming. Had the assertion not been made, the document would have been conforming.

VALCHECK, (TAGVALID?, TYPVALID?)

Entity constraint assertions

The "check entity constraint assertions" (ENTCHECK) parameter allows a document to assert whether it satisfies specified constraints. It is a reportable markup error if an assertion is untrue.

ENTCHECK, (INTEGRAL?, (NOREF | NOEXTREF | NOENTDEC)?)

where:

INTEGRAL The document instance is integrally-stored.

NOREF The document is reference-free.

NOEXTREF The document is external-reference-free.

NOENTDEC The document is entity-declaration-free.

Note: For example, if an otherwise conforming SGML document incorrectly asserts that it is integrally-stored, the document is non-conforming. Had the assertion not been made, the document would have been conforming.

Application requirements

A new parameter is declarable in the SGML Declaration, the "Application Requirements" parameter:

SEEALSO,  s ((public_identifier s)+ | NONE)

The public identifiers identify notations that specify application-specific requirements for the document, including requirements unrelated to the SGML language. These requirements are in addition to, and must not contradict, the requirements of this International Standard. Failure to satisfy application requirements is not a reportable markup error, except to the extent that such requirements are also expressed in other parameters of the SGML declaration.

Note: For example, this parameter could be used by an SGML system to signal the existence of requirements for specific entity constraint assertions, formatting conventions for specified element types, or data restrictions, such as that the number of cells in a table row does not exceed the number specified in some attribute. It is not a reportable markup error if the application's required entity assertions are not present in the SGML declaration, although if they are present, it is a reportable markup error if the the document fails to satisfy them.

It is not an error if the system is unable to access the object named by the public identifier.

NOTE: See Annex L "Application Requirements for XML" for an example.

Empty element start-tags

A new delimiter role NESTC (NET-enabling start-tag close) is defined. If it is not assigned, the NET string is used. It must be used to close a start-tag if NET is to be used as the end-tag.

Note: For example, if NESTC is "/" and NET is ">", an empty "img" element with a null end-tag would look like:

<img/>

SHORTTAG unbundled

If SHORTTAG NO is specified, any of the following can be specified to enable specific short tag minimization features:

(STARTTAG, YES)?

means that empty start-tags can be used.

(ENDTAG, (EMPTYTAG | (NETANY | NETEMPTY) | BOTH), REQUIRED?)?

means that empty, null, or both forms of short end-tag can be used. Specifying NETEMPTY means that NESTC and NET are restricted to elements with empty content (not necessarily declared empty). If REQUIRED is specified, the indicated form(s) must be used for the applicable elements and unminimized end-tags cannot be used.

Note: If empty elements are prohibited from having end-tags, NETEMPTY effectively prohibits the use of NET.

(ATTS, (DEFAULT?, NONAME?, NOQUOTE?))?

means that any or all of the three forms of attribute minimization are permitted:

DEFAULT enables attribute value defaulting.

NONAME allows some attribute names to be omitted.

NOQUOTE allows some attribute values to be specified directly.

Attribute definitions

Syntax summary

The syntax for ATTLIST is revised to provide the functionality of the syntax shown here:

<!ATTLIST "#NOTATION"?
  (name | name group | #IMPLICIT | #ALL )
  attribute definition*
>
where:

"name" is either an element type name or a notation name, depending on whether #NOTATION is specified.

"name group" is one or more names in parentheses.

#IMPLICIT refers to all implicitly defined element types (or notation names). It is the equivalent of a name group.

"#ALL" is all element type names or notation names.

Multiple declarations

The same element type name (or notation name) may be the associated element type (or notation) in multiple ATTLIST declarations. An attempt to redefine an attribute that was previously defined for specified element types or notations is not an error; the earliest definition prevails (just as for entity declarations).

Empty definition list

An ATTLIST declaration can have an empty list of attribute definitions.

Universal definitions

"#ALL" can be specified as the associated element type name (or notation name) in an attribute definition list declaration. If so, the definitions are associated with all element type names (or notation names). Definitions associated with #ALL can be overridden by attribute declarations for specific element types or notations, including definitions specified with #IMPLICIT (all implicitly-defined element types or notations). An attempt to redeclare for all element types (or notations) an attribute that was previously declared for all element types (or notations) is not an error; the earliest declaration prevails (just as for entity declarations).

Duplicate name tokens

The current restriction on duplicate name token ("enum") values for attributes in the same set of attribute definitions is removed. Minimization by omitting the attribute name is possible only for values that occur for only one attribute in the set of attribute definitions.

Implicit definitions

Implicit document type name

The keyword "#IMPLIED" is allowed as an alternative to the document type name in a DOCTYPE declaration or the source document type name in a LINKTYPE declaration. When the keyword is used, the document type name is the element type name of the document element. For example:

<!DOCTYPE #IMPLIED SYSTEM "some.dtd"> <docelem>

It is a reportable markup error if #IMPLIED is specified and the document instance does not begin with the start-tag of the document element.

This facility is allowed when there is only one instance and this is the only doctype declaration.

Note: This facility is therefore incompatible with explicit link and concur.

Implicit document type declaration

When IMPLYDEF DOCTYPE is specified and there is only one document instance and no document type declarations, the document type declaration associated with the instance is assumed to be:

<!DOCTYPE #IMPLIED SYSTEM>

Note: This facility is used to imply the applicable DTD. When parsing without respect to DTD declarations, there is no need to imply an applicable DTD.

Note: This facility is incompatible with explicit link and concur.

Other implicit definitions

When IMPLYDEF is specified with the keywords ELEMENT, ENTITY, %ENTITY, NOTATION, and/or ATTLIST, the corresponding constructs can be used without an explicit declaration. The implied definitions are as follows:

ELEMENT: "- - ANY"

ENTITY, %ENTITY, NOTATION:  "SYSTEM"

ATTLIST: "CDATA #IMPLIED" for each attribute definition.

When IMPLYDEF ENTITY is specified, a default entity declaration is not permitted.

Elements with empty content

Empty element end-tags

The EMPTYETG feature indicates whether elements with empty content have end-tags. The possible values are:

NO They are not permitted. (This is the case without this TC.)

YES They are permitted and are subject to markup minimization in the same way as any other end-tag.

When processing with reference to DTD declarations, it is a reportable markup error if an element that is required to be empty contains any text, including white space, other markup, or included subelements.

Supporting changes

Internet domain names in public identifiers

Internet IP domain names that contain only minimum data can be used as public text owner identifiers. To do so, the formal public identifier must begin with "IDN//domain.name".

Note: Because of different name-spelling rules, not all internet domain names can be used in this way.

Note: When constructing a public text owner identifier, users may wish to consider its potential lifespan and that of the objects to be identified by it.

Element type declaration

The term "element declaration" in the definitions clause 4.111 is changed to "element type declaration".

All other occurrences of "element declaration" in this International Standard are changed to "element type declaration".

Entity set

In production 113, optional attribute definition list declarations and notation declarations are permitted. Formally:

[113] entity set =
      ( entity declaration |
        attribute definition list declaration |
        notation declaration |
        ds )*
where the keyword "#NOTATION" must be specified in the attribute definition list declaration.

SGML declaration in subdocuments

A subdocument can contain an SGML declaration.

Parsing without reference to DTD declarations

End of required declarations

A new EOR ("end of declarations that are required for all purposes") indicator can be placed anywhere among the DTD declarations. Declarations occurring after the first EOR indicator can be ignored during any processing of a fully-tagged document instance other than type validation.

Indicator syntax: mdc,"EOR", mdc

Example: <!EOR>

It is a reportable markup error if the groves produced by parsing with and without respect to the ignorable declarations are not functionally equivalent groves.

A validating parser that fails to parse required declarations must report the possibility that the grove it produces may not be functionally equivalent.

Note: Non-validating parsers should also report this situation.

Note: The following document type declaration indicates that the external subset is not required for all purposes. The same effect could be achieved by an EOR indicator at the start of the external subset, but the form in the example makes it unnecessary to access the external subset.

<!DOCTYPE sometype SYSTEM "some.dtd" [<!EOR>]>

Note: Although the EOR indicator uses the MDO and MDC delimiters, it is not a markup declaration.

Inferred grove properties

When a document is parsed without reference to some DTD declarations, their essential grove properties are inferred from the document instance as follows:

  1. Element types have a contype property of ANY. They occur in the grove in the order in which their elements first occur in the document.
  2. Attribute definitions have a decltype property of CDATA and a dflttype property of #IMPLIED. They occur in the grove in the order in which their attribute specifications first occur in the document. For an impliable attribute, whether or not the declaration for its definition was parsed, no attribute assignment is created unless there is an attribute specification.
  3. The document type name is determined as though it had been specified as #IMPLIED.

    Note: These working definitions for the grove do not imply how element types and attribute definitions are actually declared (if DTD declarations exist) or should be declared (if declarations are to be created later).

Predefined data character entities

When the predefined data character entities feature is used, a general entity name can be associated with a character number in the syntax-reference character set. When referenced, the replacement text is a numeric character reference to the corresponding character. Predefined data character entities are treated as though defined at the start of the internal subset of all documents in which the concrete syntax is used.

Note: For example:

ENTITIES
         "amp"  38
         "lt"   60
         "gt"   62
         "quot" 34
         "apos" 39

SYSTEM declaration

Functionally equivalent groves

The following sub-parameter is added to the validation services parameter of the SYSTEM declaration to indicate whether the parser can validate whether parsing a document with and without respect to declarations occurring after the <!EOR> indicator produces functionally equivalent groves.

FUNGROVE (NO | YES)

Validity checking

The following sub-parameter is added to the validation services parameter of the SYSTEM declaration to indicate whether the parser can validate for type-validity, tag-validity, or both:

VALCHECK (NO | TYPVALID | TAGVALID | BOTH)

Entity constraint checking

A new parameter, "entity constraint checking services" (ENTCHECK) is added to the SYSTEM declaration to indicate the kinds of entity constraint checking that a parser can perform. The keywords for the constraints are:

INTEGRAL  The document instance is integrally-stored.

NOREF The document is reference-free.

NOEXTREF The document is external-reference-free.

NOENTDEC The document is entity-declaration-free.

Syntax summary

SGML declaration

The SGML declaration parameters described in this Annex are defined as follows.

System declaration

The system declaration parameters described in this Annex are defined as follows.

Annex L (informative): Application Requirements for XML

These application requirements have the formal public identifier "ISO 8879//NOTATION Application Requirements for XML//EN".

This is version 1.0 of this annex.

Note: Pointers to revised versions of this annex, and other SGML-related information that may change over time, can be found at the International SGML User's Group web site at URL "www.isug.org". The SGML Users' Group is an international non-profit membership organization, chartered as an educational charity in the United Kingdom, and is a liaison member of the ISO/IEC subcommittee that developed SGML. Nevertheless, the documents that it distributes have not been subject to ISO/IEC review procedures, have no official status, and are not endorsed by the ISO or IEC or any of its national member bodies or affiliates.

Application summary

The Extensible Markup Language (XML) is an application profile of SGML, developed for exchanging SGML documents over the World Wide Web. It is defined in URL "http://www.w3.org/WWW/TR/WD-xml-lang-970331.html".

The XML specification covers the following aspects of an SGML application and system:

  1. Restrictions on the use of some SGML language constructs.
  2. A required character set and mandatory encodings.
  3. Techniques by which a system can identify the character-encoding used in a document.
  4. Constraints on the mapping of entities to physical storage objects.
  5. Application semantics of the type appropriate for an enabling architecture (that is, less than a complete DTD and associated semantics, which are customary for applications, such as HTML).
  6. Details of the interface between applications and an SGML/XML parser beyond what is addressed in ISO 8879 and related standards.

XML distinguishes two classes of conforming documents. Its "valid" documents are type-valid conforming SGML documents. When the WebSGML Adaptations are in use, XML "well-formed" documents are tag-valid conforming SGML documents.

A full-SGML validating parser cannot presently validate SGML documents for conformance to XML unless it is specially modified to support XML. That is because some of XML's language restrictions cannot be expressed in the SGML declaration, even when the WebSGML Adaptations are in use.

The remainder of this annex describes only the XML restrictions on the SGML language. The XML specification should be consulted for a description of the other application requirements.

XML restrictions on full SGML

The following list describes language requirements of XML beyond those of full SGML with WebSGML Adaptations support (that is, ISO 8879 as modified by Annexes J and K). The list is believed to be complete as of publication of this annex.

SGML declaration

XML requires a specific SGML declaration. It defines the XML character set, concrete syntax, and the SGML features and options that are supported or expressly prohibited.

SGML declaration reference

Explicit SGML declaration not permitted.

Entity constraints

Document instances must be integrally-stored.

Integral storage does not require a marked section to start and end in the same entity.

In well-formed XML documents, attribute values cannot contain external entity references.

An entity must use a single character encoding throughout.

Features and options

SUBDOC, LINK, CONCUR, and markup minimization are not supported, except for attribute value defaulting.

Does not use short references.

Other considerations

Requires a particular document character set.

Requires a particular concrete syntax.

For elements with empty content, whether or not declared empty, the start-tag must be closed with the NETSC delimiter and the end-tag, which is required, must be a NET delimiter.

Both prolog and document instance

Comments

A comment declaration consists of exactly one delimited comment.

Reference end

Reference end is restricted to REFC It is required in order to terminate a reference.

Delimiter recognition

A delimiter that occurs in a mode where data can occur (e.g., CON and LIT modes) with a contextual constraint that it be followed by a name start character, a digit, or a hexadecimal digit, is recognized even if this contextual constraint is not satisfied.

Character references

Named character references are not supported.

Document instance

Marked section in document instance

A marked section declaration is restricted to a single status keyword: CDATA. It must immediately follow the first DSO and must immediately be followed by the second DSO. It cannot be specified via a parameter entity reference.

Entity references

Entity references can refer only to SGML text entities (not data or subdocument entities)

Prolog (DTD declarations)

All declarations

Comments are not allowed in parameter separators of DTD declarations.

Parameter entity references permitted only in restricted locations with restricted replacement text.

No public identifiers in external identifiers (affects ENTITY, DOCTYPE, and NOTATION declarations).

Marked section in prolog

A marked section declaration is restricted to a single status keyword, either INCLUDE or IGNORE. It must immediately follow the first DSO and must immediately be followed by the second DSO. It can be specified via a parameter entity reference.

Element type declarations

No name groups for declaring multiple element types.

No CDATA or RCDATA declared content.

No exclusions or inclusions in content models.

No minimization parameters.

Mixed content models must be optional-repeatable OR-groups, with #PCDATA first.

No AND (&) content model groups.

Attribute definition list declarations

No name groups for making a single ATTLIST declaration apply to multiple element types.

No NAME[S], NUMBER[S], or NUTOKEN[S] declared values for attributes.

No #CURRENT or #CONREF default values for attributes.

Attribute default values must be quoted.

No data attributes for NOTATIONs.

Entity declarations

No SDATA, CDATA, or bracketed internal entities.

No SUBDOC, CDATA, or SDATA external entities.

No attribute value specifications on ENTITY declarations (implied by prohibition on data attributes for NOTATIONs.

Language implementation issues

An entity must be stored completely within a single storage object

A parser must support external entity reference resolution "on demand", except when validating.