Notes on sgmls handling of search for entities

C. M. Sperberg-McQueen

September 1993, rev. January 1994

Table of Contents

The resolution of public entity names (i.e. how the entity manager finds them in the file system or elsewhere) is implementation dependent. These notes, originally intended solely for my own use, describe how the public-domain parser sgmls actually implements the mapping. It should be pointed out that the approach used by sgmls does not provide a completely arbitrary mapping between entity name or formal public identifier and the file identifier; this is not a problem for most users.

Let us start with the basic question: "How does sgmls locate an external entity when a PUBLIC identifier is used?"

To that, the high-level answer is:

  1. first it checks for an environment variable named SGML_PATH, which you will therefore probably want to set beforehand to something useful. If it doesn't find one, it uses a hardcoded default path (check your system documentation).
  2. the SGML_PATH value is a set of one or more system identifier patterns, separated by colons or semicolons (OS-dependent, check your documentation). Each pattern can have literal characters as well as special keywords of the form %X (the sgmls documentation calls these substitution fields). E.g. %S;N.%C;/sgml/pub/%N.%C
  3. for each system ID pattern, in sequence, sgmls replaces the substitution fields with appropriate string values. It then looks for a file with that system identifier. If it finds it, it opens it. If it can't find it, it tries the next pattern. The meaning of the substitution fields is described in file sgmls.doc, which is part of the standard distribution. If a substitution field is meaningless (e.g. because it refers to a different type of PUBLIC identifier from the one being processed), the pattern is passed over.

The low-level answer follows. N.B. in version 1.1 some of the handling of blanks, underscores, etc. has changed, and may not be as described: check the documentation for the version you are using.

1 Simple example

Set an environment variable called SGML_PATH to a value. For the moment, let's assume you set it by saying something like this:

set SGML_PATH %S:%N.%X:%N.%V:%N.%C 
The default path (at least on Unix systems) is noted below.

The SGML_PATH environment variable governs the search SGMLS performs for public entities. Specifically, given declarations like these:

         "ISO 8879-1986//ENTITIES Added Latin 1//EN//Local" > 
         "ISO 8879-1986//ENTITIES Added Latin 2//EN//Local" > 
<!ENTITY % p2idmss PUBLIC
         "-//TEI//ENTITIES Marked sects for TEI P2 IDs//EN" 
         'p2idlist.entities' > 
SGMLS will ask the file system for a series of files, in an order determined by the SGML_PATH value. The first one found by the file system is the one used by SGMLS. For ISOLat1:

For ISOLat2, similarly, substituting ISOLat2 for ISOLat1. For the other one:

This is as far as I understand things at present; certainly sgmls has succeeded in picking up isolat1.local, isolat1.entities, isolat1.ppe, p2idmss.entities, and p2idlist.entities, under appropriate conditions.

2 Tree-structured directories, a more complex example

In a system with tree-structured directories, of course, more of the public identifier can be used to find stuff. The default path search for the ISOLat1 would be, as I understand it:


3 Overview of public identifiers and sgmls substitution fields

For the record, the overall structure of formal public identifiers is:

  pubid ::= owner '//' class ('-//')? desc '//' lang ('//' version)?
  owner ::= 'ISO' data | '+//' data | '-//' data
  class ::= CHARSET | ENTITIES | DTD | DOCUMENT | etc.
  desc  ::= data
  lang  ::= /* code from ISO 639 */
        |   'ESC' n/n n/n
  version ::= data

The components of the publid identifier are picked up by different `substitution fields':

the entire public ID?
the owner name (minus the +// and -//) (%I, %R, and %U can be used to ensure that a search pattern only succeeds for ISO owners, registered owners, or unregistered owners; they expand to the empty string in the appropriate case, and to null (failure) otherwise)
the class
the description of the public entity
the language code (EN, FR, etc. from ISO 639)
the character set escape sequence
the version descriptor

Various types of case folding and character substitution or character deletion are performed, which should be described in the documentation for the version of sgmls you are using (they are set at compile time, and differ in the Unix, VMS, and DOS versions, to suit the operating systems).

Still other substitution fields pick up other parts of the entity declaration:

the entity's `data content notation'. I am not sure, but believe what is meant by this is the notation name given for an external entity declared as being in a specific notation. In the declaration the name tiff is the notation name.
the entity name (the name used in references to this entity)
the public identifier (the whole string, it appears)
the system identifier (i.e. in the string bar.doc)
a string chosen from a rather complex table, depending on whether sgmls is searching for a data entity, subdocument entity, general text entity, parameter entity, dtd, or lpd, and on whether it is declared without a public identifier, with a public identifier, with a device-dependent version string or without one
a string chosen from a simple table depending on whether sgmls is searching for a subdoc entity, a data entity, a text entity, a parameter entity, a dtd, or an lpd
causes the search to fail if the formal public identifier contains an `unavailable text' indicator