Entity Management

SGML Open Technical Resolution 9401:1995
(Amendment 1 to TR 9401)

Paul Grosso, Chief Technical Officer
SGML Open
1995 September 8

© 1994, 1995 SGML Open

Permission to reproduce parts or all of this information in any form is granted to SGML Open members provided that this information by itself is not sold for profit and that SGML Open is credited as the author of this information.

Abstract

Two different but related issues pertaining to entity management impede interoperability of SGML documents:

While there are many important issues involved and a complete solution is a long term goal, the SGML Open membership agrees upon the enclosed simple set of conventions to address a useful subset of the complete problem. To address issue A, this resolution defines an entity catalog that maps an entity's external identifier and/or name to a file name. To address issue B, this resolution defines a simple interchange packaging scheme using an interchange catalog to associate a public identifier with each interchanged file.

Technical Resolution 9401:1994
Final Technical Resolution: 1994 August 9
Technical Resolution 9401:1995 (Amendment 1)
Committee draft 1: 1995 March 1
Committee draft 2: 1995 March 23
Final Draft Technical Resolution: 1995 April 19
Final Technical Resolution: 1995 September 8


Introduction

In order to use a variety of SGML tools in a variety of computer environments, there are two different but related problems to solve:

There are many important issues involved and a complete solution -- possibly including work within the standards community -- is a long term goal. However, the SGML Open membership agrees at this time upon a simple set of conventions that addresses a useful subset of the complete problem.

The short term solution for issue A defines an entity catalog that handles the simple cases of mapping an external entity's public identifier and/or entity name to a system-dependent file name. This solution allows for a probably system-dependent but application-independent catalog. Though it does not handle all issues that a complete entity manager addresses, it simplifies use of multiple products in a great majority of cases.

While there are various interchange strategies already defined -- including the SGML Document Interchange Format (SDIF) defined in ISO 9069 -- none are currently widely used or supported by enough readily accessible implementations. This resolution addresses issue B by defining a simple interchange packaging scheme using an interchange catalog to associate a public identifier with each interchanged file.

Issue A: a simple entity catalog format

To address the issue of multiple vendors' applications on a given system, this resolution defines a format for a probably system-dependent but application-independent entity catalog that maps external identifiers and/or entity names to file names. This catalog is used by an application's entity manager. This resolution does not dictate when an entity manager should access this catalog; for example, an application may attempt other mapping algorithms before or (if the catalog fails to produce a successful mapping) after accessing this catalog. The catalog has a standard format. Each application that uses it must provide the user with a mechanism for specifying how and when the catalog is to be accessed. The logical catalog may be composed of multiple catalog entry files. It is up to the application to determine the ordered list of catalog entry files to be used as the logical catalog.

Each entry in the catalog associates a "storage object identifier" (such as a file name) with information about the external entity that appears in the SGML document. ("Storage object identifier" is frequently abbreviated "s.o.i." below.) For example, the following are possible catalog entries that associate a public identifier with an s.o.i.:

PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN" "iso-lat1.gml"
PUBLIC "-//USA/AAP//DTD BK-1//EN" "aapbook.dtd"
PUBLIC "-//ACME DTD Writers//DTD General Report//EN" "report.dtd"

In addition to entries that associate public identifiers, a catalog entry can associate an entity name with an s.o.i.:

ENTITY "chips" "graphics\chips.tif"

Both types of entries can occur in a single catalog:

PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN" "iso-lat1.gml"
PUBLIC "-//ACME DTD Writers//DTD General Report//EN" "report.dtd"
ENTITY "graph1" "graphics\graph1.cgm"

The name field in an ENTITY type catalog entry gives the “entity name" as specified in the entity declaration of an entity whose entity text is specified by an external entity specification. Note that, if the entity name is a parameter entity name (as opposed to a general entity name), an initial percent sign (%), is part of the name. (The percent sign -- which is the reference concrete syntax replacement for the “PERO" character -- shall be used in the catalog regardless of the concrete syntax of the current document.) It should be noted that ENTITY type catalog entries will not match the reference to the external subset in a DOCTYPE or LINKTYPE declaration. The complete set of catalog entry types defined by this Resolution are: PUBLIC, ENTITY, DOCTYPE, LINKTYPE, SGMLDECL, DTDDECL, and DOCUMENT.

Furthermore, to provide for possible future extensions or other uses of this catalog, its format allows for "other information" -- indicated by a "keyword" other than one of those defined by this Resolution -- that is irrelevant to and ignored by this resolution.

The formal syntax for a catalog entry file is:

catalog = 
  (ps*, ((extended external identifier | other information), ps+)*)

other information = keyword, (ps+, argument)+

keyword = system character+

argument = 
  system character+ |
  (LIT, system character+, LIT) |
  (LITA, system character+, LITA)

extended external identifier = 
  (publicid keyword, ps+, public identifier, ps+, storage object identifier) |
  (name keyword, ps+, entity name spec, ps+, storage object identifier) |
  (noname keyword, ps+, storage object identifier)

publicid keyword = "PUBLIC" | "DTDDECL"

name keyword = "ENTITY" | "DOCTYPE" | "LINKTYPE"

noname keyword = "SGMLDECL" | "DOCUMENT"

entity name spec =
  extended name character+ |
  (LIT, extended name character+, LIT) |
  (LITA, extended name character+, LITA)

storage object identifier =
  system character+ |
  (LIT, system character+, LIT) |
  (LITA, system character+, LITA)

ps = s | comment

comment = COM, system character*, COM

LIT = '"'   -- the double quote --

LITA = "'"   -- the single quote --

COM = "--"

where

Additional requirements:

This resolution only requires applications to handle storage object identifiers that specify file names. (Whether the s.o.i. can contain, for example, environment variables or special characters that are expected to be expanded further before resolving to a file name is not prescribed by this Resolution.) Applications may in addition recognize other types of storage object identifiers, as long as a storage object identifier that does not include characters other than letters, digits, hyphen, period, and underscore continues to be treated as a file name. Therefore, to avoid possible interpretation as something other than a file name, it is recommended (but not required) that file names be restricted to the characters just mentioned.

An entry in the catalog is interpreted as follows:

The declaration of every external entity includes an entity name. (For the purposes of this discussion and the table below, we consider the term "entity name" to encompass also the doctype name from the document type declaration and the link type name from the link type declaration.) It may, in addition, associate a public identifier and/or a system identifier with the external entity. (See the table below for a summary of the various possibilities.) Note that the catalog maps public identifiers and entity names into storage object identifiers; the catalog does not map system identifiers (which usually represent valid storage object identifiers directly).

When doing a catalog lookup, an entity manager searches each catalog entry file for the external entity's public identifier. (This is true for any public identifier regardless of the type of entity.) If there are no successful matches, the entity manager searches for the entity name in the same catalog entry file attempting to match against the appropriate type of catalog entry given the entity type. The entity manager searches the next catalog entry file (if any) only after attempting all applicable searches for a given external entity. At the first match of external identifier or name, the entity manager stops the catalog lookup process and uses the associated s.o.i. to locate the entity.

Generally, when a system identifier is specified in an external entity declaration, it can be trusted to be a valid s.o.i. However, in some circumstances (such as when the document was generated on another system, when the document was generated in another location on the same system, or when some files referenced by system identifiers have had their locations changed since the document was generated), the specified system identifiers may not be valid. In this latter case, preferring the public identifier or entity name over the system identifier may be a more dependable way of accessing the entity. Therefore, this resolution defines two modes for using the above search strategy when an external identifier has an explicit system identifier:

An application must provide some way (e.g., a runtime argument, environment variable, preference switch) that allows the user to specify which of these modes to use.

The accompanying table summarizes the various possibilities.

External identifier includes: Prefer/trust SYSTEM id? Identifiers to try (in order) to resolve into the s.o.i.
SYSTEM id PUBLIC id
no no Not applicable ENTITY
no yes Not applicable PUBLIC, ENTITY
yes no no ENTITY, SYSTEM
yes no yes SYSTEM
yes yes no PUBLIC, ENTITY, SYSTEM
yes yes yes SYSTEM

The use of hyphens or colons in the ISO owner identifier

Since this resolution pertains to public identifiers, it addresses one additional detail about public identifiers. ISO 8879 is inconsistent about the use of hyphens and colons in ISO owner identifiers. Clause 10.2.1.1 of 8879:1986 (unamended) has a note indicating that the ISO owner identifier for the SGML standard is "ISO 8879-1986". Production [171] of clause 13 indicates that the minimum literal in the SGML declaration must be "ISO 8879-1986". While Amendment 1 of 8879 does not alter clause 10.2.1.1, it does alter production [171] of clause 13 to say that the minimum literal in the SGML declaration should be "ISO 8879:1986". This has lead to the propagation of both the dash and the colon in ISO owner identifiers. In the interests of interoperability, this SGML Open resolution requires that all products accept either form as a valid ISO owner identifier. Note, however, that this should not be construed to mean that a public identifier using one form should necessarily cause a catalog lookup match to succeed with a public identifier using the other form; while this resolution requires SGML systems to accept either form as valid, in practice, two entries (differing only by the single ":" or "-" character) may be needed in the catalog if both forms should refer to the same storage object identifier.

Referencing the implied SGML declaration

The SGML standard allows for an SGML declaration to be included explicitly in a document or to be implied by the processing system. This Resolution defines two ways to specify the implied SGML declaration: the SGMLDECL catalog entry type and the DTDDECL catalog entry type. Note that, in the DTDDECL method, the implied SGML declaration depends on information in the remainder of the document. Since the SGML declaration must be processed before a parser can interpret the prolog and document instance set, an implementation may choose to determine the SGML declaration with a preprocessor that scans the document for the relevant information. In any case, once it has been determined whether an explicit SGML declaration is present and, if not, how to locate the implied SGML declaration, parsing begins at the start of the document.

In many situations, the appropriate SGML declaration can be inferred from the "DTD" in use. This is especially common in the case that the external subset referenced in the doctype declaration is a publicly distributed entity. Therefore, this Resolution adds the capability to associate an SGML declaration with a "DTD" referenced by a PUBLIC identifier. In particular, if there is no explicit SGML declaration and the doctype declaration uses a PUBLIC identifier to reference the external subset (commonly known as "the DTD"), then the catalog will be searched for a DTDDECL entry whose public identifier field matches the public identifier of the external subset, and the associated s.o.i. will be used to locate the default SGML declaration to be used.

If there is no explicit SGML declaration and no DTDDECL entry was applicable, then the catalog will be searched for the first SGMLDECL entry, and its s.o.i. will be used to locate the default SGML declaration to be used. The use of an SGMLDECL catalog entry, in fact, is the preferred method of indicating the SGML declaration when an SGML declaration is part of a transfer package but is not transmitted as the initial part of the document entity itself.

Issue B: an interchange packaging scheme

The issue of interchanging a set of files among different systems can be partially addressed by an interchange packaging scheme that includes an interchange catalog that associates external identifiers with the various files in the interchange package. This resolution, which assumes the catalog format defined above, describes such a scheme.

This resolution does not support the use of explicitly specified system identifiers; that is, an external entity's declaration may specify a public identifier or it may use the SYSTEM keyword with no system identifier (in which case the entity's name will be used to do a catalog lookup for a matching catalog entry indicated by the ENTITY keyword). This resolution assumes a transmission medium that allows for the interchange of names for the various files in the interchange package and that the naming scheme allows that the name CATALOG be a valid name (in either all uppercase or all lowercase).

The actual transmission medium and details of writing and reading the interchange package are irrelevant. This resolution assumes that there exists a single location (e.g., directory) on the receiving system that already contains the set of interchanged files. (The generation of such an interchange package by the sending system is not explicitly discussed, but it is assumed that this discussion about receiving and interpreting an interchange package will make clear what is necessary to do on the sending system.) In this resolution, the phrase "interchange package" refers to this set of files in this location and "interchange directory" refers to this location.

An interchange package must have exactly one file named either CATALOG or catalog (more precisely, one and only one such file at the top level of the interchange directory); this is the catalog pointing into the interchange package itself. This catalog entry file must have a mapping for all files in the interchange package. That is, for each file in the interchange package (other than this catalog file), there must be a catalog entry whose s.o.i. identifies the file.

Ordinarily, the catalog should include a single entry of the DOCUMENT type whose s.o.i. identifies the file in the interchange package that is the document entity in which parsing begins, if any such entity exists in this interchange package. (Some interchange packages may not include such an entity, for example, if the interchanged files are a set of entity declaration files.) Although it does not prohibit such interchange, this resolution does not make explicit allowance for including multiple documents in a single interchange. To ensure maximum portability, each interchange package should consist of at most one document. (Since this resolution does not address details of actual transmissions, it does not prohibit multiple interchange packages within a single transmission.)

Provided that the interchange package's catalog has an unambiguous entry for each file named in the interchange package, an interchange package is valid even if the receiver must modify the s.o.i.'s in his/her copy of the catalog so that they are valid on the receiving system. However, when the sending and receiving systems have compatible naming schemes, files in the destination location may be given the same names as they had on the sending system. This possibility is more likely because relative paths in s.o.i.'s are relative to the catalog file and therefore relative to the top level of the interchange directory. If the receiving system is unknown or incompatible with the sending system, the sender may wish to construct an interchange package with names that are most likely to be valid on the widest variety of systems. (For example, an interchange package with file names of no more than eight alphanumeric characters -- and therefore no directory hierarchy -- should be maximally portable. However, this resolution does not impose any such restrictions since, in practice, it will often be known what the receiving system can handle, and it will be preferable to take advantage of its capabilities.)