SGML: ESIS

Last modified April 17, 1997. Robin Cover.

This document is part of the SGML Web Page. Support for development and maintenance of the SGML Web Page is provided in part by SoftQuad, Inc. and by the Summer Institute of Linguistics, to whom gratitude is acknowledged.

[Back to Main Page, Table of Contents] - [Search the entire SGML database]


ESIS - ISO 8879 Element Structure Information Set

This document was created from two separate files in the WG8 archive in order to present online access to the description of ESIS (ISO 8879 Element Structure Information Set). From: 0931.doc and 0931esis.doc. See also an online version of the ESIS description at Charles Goldfarb's WWW site, as an attachment to N1035, which (for reasons unknown to me) may be more authoritative than this document; [its mirror copy]. For purposes of clarity, the attachment describing ESIS is presented here first - before the N931 document on ISO 8879 revisions.


The ISO 8879 Element Structure Information Set (ESIS)

ISO/IEC JTC1/SC18/WG8 N931 (ESIS)

Title:

The ISO 8879 Element Structure Information Set (ESIS)

Attachment 1 to ISO/IEC JTC1/SC18/WG8 N931, Recommendations for a possible revision of ISO 8879

Source:

Charles F. Goldfarb

Project:

1.18.15.1

Project editor:

Charles F. Goldfarb

Status of document:

Approved WG8 document

Requested action:

For information

Date

89-08-30

Distribution:

WG8 and liaisons

ISO/IEC JTC1/SC18/WG8 N931 (ESIS)

Page 

ISO/IEC JTC1/SC18/WG8 N931 (ESIS)

Page 

ISO/IEC JTC1/SC 18/WG 8/N931

Attachment 1 (ESIS)

The ISO 8879 Element Structure Information Set (ESIS)

There are two kinds of SGML application (and therefore two kinds of conforming SGML application):

1. A structure-controlled SGML application operates only on the element structure that is described by SGML markup, never on the markup itself.

2. A markup-sensitive SGML application can act on the actual SGML markup and can act on element structure information as well. Examples include SGML-sensitive editors and markup validators.

The set of information that is acted upon by implementations of structure-controlled applications is called the "element structure information set" (ESIS). ESIS is implicit in ISO 8879, but is not defined there explicitly. The purpose of this paper is to provide that explicit definition.

ESIS is particularly significant for SGML conformance testing because two SGML documents are equivalent documents if, when they are parsed with respect to identical DTDs and LPDs, their ESIS is identical. All structure-controlled applications must therefore produce identical results for all equivalent SGML documents. In contrast, not all markup-sensitive applications will produce identical results from equivalent documents. (For example, a program that prints comment declarations, or that counts the number of omitted end-tags.)

ESIS information is exchanged between an SGML parser and the rest of an SGML system that implements a structure-controlled application. Although an implementation may choose to "wire in" some of ESIS, such as the names of attributes, a structure-controlled application need have no other knowledge of the prolog than what ESIS provides.

A system implementing a structure-controlled application is required to act only on ESIS information and on the APPINFO parameter of the GML declaration.

NOTE -- This requirement does not prohibit a parser from providing the same interface to both structure-controlled and markup-sensitive applications, which could include non-ESIS information (e.g., the date), and/or information that could be derived from ESIS information (e.g., the list of open elements).

NOTE -- The documentation of a conforming SGML system that supports user-developed structure-controlled applications should make application developers aware of this requirement. Such a system should facilitate conformance to this requirement by distinguishing ESIS information from non-ESIS in its interface to applications. Note 1 in 15.3.5 of ISO 8879 applies only to structure-controlled applications.

In the following description of ESIS, information is identified as being available at a particular point in the parsed document. This identification should not be interpreted as a requirement that the information actually be exchanged at that point -- all or part of it could have been exchanged at some other point. Similarly, there is no constraint on the manner (e.g., number of function calls) or format in which the exchanges take place.

The ESIS description includes the information associated with all of the SGML optional features. When a given feature is not in use, corresponding information is not present in the document. ESIS information is transmitted from the parser to the application unless otherwise indicated.

ESIS information applies to a single parsed document instance. Therefore, if concurrent instances are being parsed, the applicable document type name must be identified. This requirement also applies when parsing intermediate instances in a chain of active links.

ESIS information consists of the identification of the following occurrences, and the passing of the indicated information for each:

1. Initialization

The application must inform the SGML parser of the active document types, the active link types, or that parsing is to occur only with respect to the base document type.

2. Start of document instance set

For each active LPD, the link type name and link set information (see 12 below) for the initial link set.

3. Start of document element only

For each active simple link, the link type name and attribute information (see 9 below) for the link attributes.

4. Start of any element

Generic identifier

Attribute information for the start-tag.

For each link rule for which this element is an associated element type, attribute information for the link attributes.

The application must inform the SGML parser which applicable link rule it chose.

For the chosen link rule, the result GI and attribute information for the result element.

If the element has an associated link set, the link set information.

5. End of any element, including elements declared to be empty

Generic identifier

NOTE -- If the element was empty, ESIS does not indicate why it was empty; that is, whether it was declared to be empty, or whether an explicit content reference occurred, or whether it just happened to contain no data characters.

6. End of document instance set

NOTE -- Processing instructions could occur between the end of the document element and the end of the document instance set.

7. Processing instruction

System data

8. Data

Includes no ignored characters (e.g., record starts).

Includes only significant record ends, with no indication of how significance was determined. Characters entered via character references are not distinguished in any way. Implementation-specific means can be used to represent bit combinations that the application cannot accept directly.

NOTE -- Such bit combinations may be those of non-SGML characters entered via character references, but no significance is attached to this coincidence.

NOTE -- Bit combinations of non-SGML characters that occurred directly in the source text would have been flagged as errors, and would therefore never be treated as data.

9. Attribute information

All attribute values must be reported and associated with their attribute names.

NOTE -- For example, a parser could supply the attribute names with each value, or supply the values in an order that corresponds to a previously-supplied list of names.

NOTE -- The order of the tokens in a tokenized attribute value shall be preserved as originally specified.

Each unspecified impliable attribute must be identified.

NOTE -- For example, a parser could identify such attributes explicitly, or it could allow the application to determine them by comparing the identified specified attribute values to a previously-supplied list of attribute names.

There shall be no indication of whether an attribute value was the default value.

The order in which attributes are specified in the attribute specification list is not part of the ESIS.

General entity name attribute values include the entity name and entity text. The entities themselves are not treated as having been referenced.

NOTE -- An application can use system services to parse the entities, but such parsing is outside the context of the current document.

For notation attributes, the attribute value includes the notation name and notation identifier.

For CDATA attributes, references to SDATA entities in attribute value literals are resolved. The replacement text is distinguished from the surrounding text and identified as an individual SDATA entity.

For CDATA attributes, references to CDATA entities in attribute value literals are resolved. The replacement text is not distinguished from the surrounding text.

10. References to internal entities

The information passed to the application depends on the entity type:

SDATA: replacement text, identified as an individual SDATA entity.

PI: replacement text, identified as a processing instruction but not as an entity.

For other references, nothing is passed to the application.

NOTE -- The replacement text is parsed in the context in which the reference occurred, which can result in other ESIS information being passed.

11. References to external entities

The information passed to the application depends on the entity type:

For data entities, the entity name and entity text are passed. If a notation is named, the notation name, notation identifier, and attribute information for the data attributes are also passed.

For SGML text entities, nothing is passed to the application.

NOTE -- The replacement text is parsed in the context in which the reference occurred, which can result in other ESIS information being passed.

For SUBDOC entities, the entity name and entity text are passed. The application can require that the subdocument entity be parsed at the point at which the reference occurred.

NOTE -- Parsing of the subdocument entity can result in other ESIS information being passed. The occurrence of the end of the document instance set of the subdocument entity will indicate that subsequent ESIS information applies to the element from which the subdocument entity was referenced.

12. Link set information

All link rules whose source element specification is implied.


N931 - Recommendations for a possible revision of ISO 8879

ISO/IEC JTC1/SC18/WG8 N931

Title:

Recommendations for a possible revision of ISO 8879

Source:

Charles F. Goldfarb

Project:

1.18.15.1

Project editor:

Charles F. Goldfarb

Status of document:

Approved WG8 document

Requested action:

For information

Date

89-08-30

Distribution:

WG8 and liaisons

ISO/IEC JTC1/SC18/WG8 N931

Page

ISO/IEC JTC1/SC18/WG8 N931

Page

Introduction

ISO/IEC standards are reviewed at least once every five years to determine whether they are still applicable or whether they should be withdrawn. Such reviews frequently result in the publication of a revised edition of the standard.

ISO 8879 was published October 15, 1986. It is the expectation of its developers (ISO/IEC JTC1/SC18/WG8) that a review will result in republication with editorial changes and possibly some new technical enhancements. The purpose of this document is to record those changes that have been agreed to by the developers.

NOTE--This document should be read carefully and taken at face value. In particular, it cannot be stated with certainty that a revision of ISO 8879 will ever be published, or that, if one is published, that any of these "accepted" items will find their way unmodified into the final draft.

Items are listed in order by clause number. General comments precede those relating to specific clauses. Each item is preceded by a two-letter code indicating the status of the item and the type of change involved. If the source of the item is a WG8 document, the document number and item number within that document, if any, are given in parentheses. (The attachments to WG8 N680 are N680A, N680B, and N680C.)

The status codes are:

A

Accepted as editing instructions for first draft of revision.

F

Accepted for further study in preparation of revision.

The types of change code are:

E

Editorial: correction of typographical errors, restatement of unclear text, and changes made for consistency or to facilitate maintenance of the document.

R

Resolution of ambiguity (conflict within text of ISO 8879)

T

Technical: innovation, or change to existing function

Items coded "E" and "R" reflect the developers' understanding of SGML as defined by the existing text of ISO 8879. Items coded "T" represent modifications to the SGML language that will not come into effect unless and until a revision of ISO 8879 is published.

General Editorial

AE

Delete some annexes and move them to technical report on Techniques for using SGML (ISO/IEC TR 9573) under indicated topics:

Annex B: Tutorial on basic SGML concepts

Annex C: Tutorial on additional SGML concepts

Annex D.3: Variant concrete syntaxes, including multicode concretesyntaxes

Annex D (except D.3): Public entity sets

Annex E.1: Example of document type definition

Annex E.2: Computer graphics metafile

Annex E.3: Device-independent techniques for code extension

AE

(N680B 81)

Change public identifiers on any revised public text.

AE

All references to ISO standards should change "-" before year to ":". For example, "ISO 8879-1986" should be "ISO 8879:1986".

AE

Avoid "instance of an element". It should be "element", or "instance of an element type" when emphasis on the type is desired.

AE

(N680B 38a)

13.4.1 onward, keywords in "where" lists are in medium font while in earlier lists they are in bold.

AE

(N924)

Examples of multiple-byte codes in ISO 8879 (none at present) or in technical reports should be modified to follow the recommendation in WG8 N924.

AE

Clauses should be further subdivided and renumbered to isolate individual requirements as much as possible, in order to facilitate correlation of test cases with the standard.

AE

(N680B 3)

Clarify that SHORTREF is semantically a named feature, but syntactically is not.

AE

(N680B 2)

Rationalize use of italicized phrases in body of standard and annexes.

General Technical

AT

Create an ASN.1 description of SGML for binary encoding (SGML-B) as a normative annex. SGML-B should not require delimiter recognition and should not employ markup minimization. However, it should be capable of preserving information about markup minimization, comments, etc., so that transformations in either direction between SGML and SGML-B can be made without loss of information.

Clause 4: Definitions

The full text of the revised definitions is given, rather than change instructions. Although this approach adds to the size of this report, it makes it easier to see the effect of the changes. All definitions are coded "AE". (N680B 20 23-24 26-28, N680C 1 7 11)

4.16 attribute (specification) list Markup that is a set of one or more attribute specifications.

NOTE--Attribute specification lists occur in start-tags, entity declarations, and link sets.

4.24 bit combination: An ordered collection of bits, interpretable as a binary number.

NOTE--A bit combination should not be confused with a "byte", which is a name given to a particular size of bit string, typically seven or eight bits. A single bit combination could contain several bytes.

4.36 character number: A number that represents the base-10 integer equivalent of the coded representation of a character.

4.38 character repertoire: A set of characters that are used together. Meanings are defined for each character, and can also be defined for control sequences of multiple characters.

NOTE--When characters occur in a control sequence, the meaning of the sequence supercedes the meanings of the individual characters.

4.39 character set: A mapping of a character repertoire onto a code set such that each character in the repertoire is represented by a bit combination in the code set.

4.42 code extension: Techniques for including in documents the coded representations of characters that are not in the document character set.

NOTE--When multiple national languages occur in a document, graphic repertoire code extension may be useful.

4.43 code set: A set of bit combinations of equal size, ordered by their numeric values, which must be consecutive.

NOTES--

1. For example, a code set whose bit combinations have 8 bits (an "8-bit code") could consist of as many as 256 bit combinations, ranging in value from 00000000 through 11111111 (0 through 255 in the decimal number base), or it could consist of any contiguous subset of those bit combinations.

2. A compressed form of a bit combination, in which redundant bits are omitted without ambiguity, is considered to be the same size as the uncompressed form. Such compression is possible when a character set does not use all available bit combinations, as is common when the bit combinations contain several bytes.

4.44 code set position: The location of a bit combination in a code set; it corresponds to the numeric value of the bit combination.

4.45 coded representation: The representation of a character as a single bit combination in a code set.

NOTE--A coded representation is always a single bit combination, even though the bit combination may be several 8-bit bytes in size.

4.51 conforming SGML document: An SGML document that complies with all provisions of this International Standard.

NOTE--The provisions allow for choices in the use of optional features and variant concrete syntaxes.

4.61 contextually required element: An element that is not a contextually optional element and

1. whose generic identifier is the document type name; or

2. whose currently applicable content token is a contextually required token.

NOTE--An element could be neither contextually required nor contextually optional; for example, an element whose currently applicable content token is in an or group that has no inherently optional tokens.

4.71 current rank: The rank suffix that, when appended to a rank stem in a tag, will derive the element's generic identifier. For a start-tag it is the rank suffix ofthe most recent element with the identical rank stem, or a rank stem in the same ranked group For an end-tag it is the rank suffix of the most recent open element with the identical rank stem.

4.75.1 data entity: An entity that was declared to be data and therefore is not parsed when referenced.

NOTES--

1. There are three kinds: character data entity, specific character data entity, and non-SGML data entity.

2. The interpretation of a data entity may be governed by a data content notation, which may be defined by another International Standard.

4.77 data tag group: A content token that associates a data tag pattern with a target element type.

NOTE--Within an instance of a target element, the data content and that of any subelements is scanned for a string that conforms to the pattern (a "data tag").

4.92 descriptive markup: Markup that describes the structure and other attributes of a document in a non-system-specific manner, independently of any processing that may be performed on it. In particular, SGML descriptive markup uses tags to express the element structure.

4.106 document type specification: A portion of a tag that identifies the document instances within which the tag will be processed.

NOTE--A name group performs the same function in an entity reference.

4.112 element set: A set of element, attribute definition list, and notation declarations that are used together.

NOTE--An element set can be public text.

4.117 empty link set: A link set that contains no link rules.

4.120 entity: A collection of characters that can be referenced as a unit.

NOTES--

1. Objects such as book chapters written by different authors, pi characters, or photographs, are often best managed by maintaining them as individual entities.

2. The actual storage of entities is system-specific, and could take the form of files, members of a partitioned data set, components of a data structure, or entries in a symbol table.

4.133 explicit link (process definition): A link process definition in which the result element types and their attributes and link attribute values can be specified for multiple source element types.

4.134 external entity: An entity whose replacement text is not incorporated in an entity declaration; its system identifier and/or public identifier is specified instead.

4.142 general delimiter (role): A delimiter role other than short reference.

4.147 graphic character: A character that is not a control character.

NOTE--For example, a letter, digit, or punctuation. It normally has a visual representation that is displayed when a document is presented.

4.149 group: The portion of a parameter that is bounded by a balanced pair of grpo and grpc delimiters or dtgo and dtgc delimiters.

NOTE--There are five kinds: name group, name token group, model group, data tag group, and data tag template group. A name, name token, or data tag template group cannot contain a nested group, but a model group can contain a nested model group or data tag group, and a data tag group can contain a nested data tag template group.

4.155 implicit link (process definition): A link process definition in which the result element types and their attributes are all implied by the application, but link attribute values can be specified for multiple source element types.

4.160.1 internal entity: An entity whose replacement text is incorporated in an entity declaration.

4.164 keyword: A parameter that is a reserved name.

NOTE--In parameters where either a keyword or a name defined by an application could be specified, the keyword is always preceded by the reserved name indicator. An application is therefore able to define names without regard to whether those names are also used by the concrete syntax.

4.167.1 link rule: A member of a link set; that is, for an implicit link, a source element specification, and for an explicit link, an explicit link rule.

4.168 link set: A named set of rules, declared in a link set declaration, by which elements of the source document type are linked to elements of the result document type.

4.171 link type declaration subset: The entity sets, link attribute sets, and link set declarations, that occur within the declaration subset of a link type declaration.

NOTE--The external entity referenced from the link type declaration is considered part of the declaration subset.

4.186 (markup) declaration: Markup that controls how other markup of a document is to be interpreted.

NOTE--There are 13 kinds: SGML, entity, element, attribute definition list, notation, document type, link type, link set, link set use, marked section, short reference mapping, short reference use, and comment.

4.205 named entity reference: An entity reference consisting of a delimited name of a general entity or parameter entity (possibly qualified by a name group) that was declared by an entity declaration.

NOTE--A general entity reference can have an undeclared name if a default entity was declared.

4.208 non-SGML data entity: A data entity in which a non-SGML character could occur.

4.224 parameter: The portion of a markup declaration that is bounded by ps separators (whether required or optional). A parameter can contain other parameters.

4.237 proper subelement: A subelement that is permitted by its containing element's model.

NOTE--An included subelement is not a proper subelement.

4.250 rank stem: A name from which a generic identifier can be derived by appending a rank suffix.

4.267 reportable markup error: A failure of a document to conform to this International Standard when it is parsed with respect to the active document and link types, other than a semantic error (such as a generic identifier that does not identify an element type) or:

1. an ambiguous content model;

2. an exclusion that could change a token's required or optional status in a model;

3. exceeding a capacity limit;

4. an error in the SGML declaration;

5. an otherwise allowable omission of a tag that creates an ambiguity;

6. the occurrence of a non-SGML character; or

7. a formal public identifier error.

4.276 separator: A character string that separates markup components from one another.

NOTES--

1. There are four kinds: s, ds, ps, and ts.

2. A separator cannot occur in data.

4.277 separator characters: A character class composed of function characters other than RE, RS, and SPACE, that are allowed in separators and that will be replaced by SPACE in those contexts in which RE is replaced by SPACE.

4.285.SGML parser: A program (or portion of a program or a combination of programs) that recognizes markup in SGML documents.

NOTE--If an analogy were to be drawn to programming language processors, an SGML parser would be said to perform the functions of both a lexical analyzer and a parser with respect to SGML documents.

4.290 short reference (delimiter): Short reference string.

4.299 simple link (process definition): A link process definition in which the result element types and their attributes are all implied by the application, and link attribute values can be specified only for the source document element.

4.312 system declaration: A declaration, included in the documentation for a conforming SGML system, that specifies the features, capacity set, concrete syntaxes, and character set that the system supports, and any validation services that it can perform.

4.315 target element: An element whose generic identifier is specified in a data tag group.

4.319 token: The portion of a group, including a complete nested group (but not a connector), that is, or could be, bounded by ts separators.

Clause 7

AE 7.9.3

Clarify that the order of the tokens is significant and cannot be changed by a parser.

Clause 9

AE 9.2.1

Add note clarifying that character classes in productions 52 and 53 are defined in Figures 1 and 2.

Clause 10

AR 10.1.6

Clarify that system must determine storage location of entity or notation from the name and external identifier; it does not "generate" a modified system identifier.

AE 10.1.7

Add note clarifying that charcter classes in production 78 are defined in Figure 1.

Clause 13

AE 13

In Note 1, change "document markup features" to: markup

minimization features

AE 13

In Note 1, change last parenthesized phrase to: (for example, if the document quantity set required larger values than were availablein the system quantity set)

FE 13

(N759 12,

N790 2)

Clarify relationship between document character set and syntax-reference character set. In particular, that concrete syntax is defined in terms of characters, not bit combinations. (Contributions invited: a short explanation for this clause; examples and discussion for a technical report.)

AE 13.1

In first sentence, change "as" to: that is,

AE 13.1.1.1

Replace last paragraph with:

The public identifier should be a formal public identifier with a public text class of "CHARSET".

AE 13.1.2

In first paragraph, change "added" to: assigned

AE 13.4.1

Replace last paragraph with:

The public identifier should be a formal public identifier with a public text class of "SYNTAX".

AE 13.4.3

Change "as" to: that is

AE 13.4.3

Change "of" to: for

AR 13.4.5

(N927 1)

Resolve conflict between intent of text and syntax production rule, which restricts the declared concrete syntax, by treating production 189 as though each occurrence of "ps+, parameter literal" were replaced by "(ps+, parameter literal)+", and by replacing each occurrence of the word "literal" in the text with "literals".

AE 13.4.5

Change all occurrences of "added" to: assigned

AE 13.4.5

Clarify that a character can be assigned only once as a lower-case name or name-start character (that is, assigned once only to either LCNMCHAR or LCNMSTRT, but not both).

AR 13.4.5

(N759 7,

N790 5)

Clarify that different lower-case characters can be associated with the same upper-case form, which can be a UC Letter. The associated upper-case forms can be the same as the lower-case, for languages (or special characters) where the concept of capitalization does not apply.

FT 13.4.5

(N759 10,

N790 1 4)

Allow the set of Digit characters to be extended by a concrete syntax (NUCHAR for "numeral characters"?). A character could not be assigned to more than one of NUCHAR, LCNMSTRT, and LCNMCHAR.

FT 13.4.5

(N927 1)

Devise a less burdensome method of declaring long sequences of character numbers.

AR 13.4.6

(N759 2)

Add new paragraph:

The length of a delimiter string in the delimiter set cannot exceed the NAMELEN quantity of the quantity set.

AR 13.4.7

(N759 1)

In production 193, replace second name with parameter literal and replace first paragraph with:

The name is a reference reserved name that is replaced in the declared concrete syntax by the interpreted parameter literal, which must be a valid name in the declared concrete syntax.

AE 13.4.7

Add new note before the existing first note:

NOTE--The list of reference reserved names that can be replaced in a declared concrete syntax is:

ANY EMPTY IDREFS MS NUMBERS RCDATA SPACE

ATTLIST ENDTAG IGNORE NAME NUTOKEN RE STARTTAG

CDATA ENTITIES IMPLIED NAMES NUTOKENS REQUIRED SUBDOC

CONREF ENTITY INCLUDE NDATA O RESTORE SYSTEM

CURRENT FIXED INITIAL NMTOKEN PCDATA RS TEMP

DEFAULT ID LINK NMTOKENS PI SDATA USELINK

DOCTYPE IDLINK LINKTYPE NOTATION POSTLINK SHORTREF USEMAP

ELEMENT IDREF MD NUMBER PUBLIC SIMPLE

AR 13.4.8

(N759 3)

In last sentence of first paragraph, change the period to: , which must exceed the reference value. The resulting quantity set must be rational.

NOTE--For example, TAGLEN must be greater than LITLEN because literals occur in start-tags. Similarly, LITLEN must exceed NAMELEN because names occur in literals.

Clause 15

FE 15

(N680B 38b-41)

Make editorial changes.

AR 15.6,

third paragraph

In the second sentence, change "as" to: that is,

Annex A

FE A

(N680B 42)

Make editorial change.

FE A

Add some key examples from current annexes B and C.

Annex F

AR F

Add explicit statement of information exchanged between SGML parser and application, based on Attachment 1 (Element Structure Information Set).

FE F.1

(N680B 74-75)

Make editorial changes.

Annex G

AE G

Delete and move to new standard on "Conformance Testing" if project for it is approved.

FE G

(N680B 76-79)

Make editorial changes.