November 30, 1993
Copyright Charles F. Goldfarb 1993.
This paper may be copied for non-commercial distribution to developers and reviewers of SGML standards in support of the work of national, international, and industry SGML standards committees, provided that all copies are accurate and complete in every respect, including this notice.
NOTE: This paper summarizes the key aspects of entity management in SGML. It is not a tutorial on the SGML entity structure. Knowledge of entity constructs (entity declarations, ENTITY attributes, entity references, data attributes, external identifiers, etc.) is assumed. The author gratefully acknowledges the many helpful contributions of Erik Naggum, Eliot Kimber, and Wayne Wohler.
SGML was originally developed to meet the requirements of publishers -- that is, people who own information and need to process it in multiple formats and disseminate it in a variety of media. SGML was NOT developed to meet the requirements of the developers of the processing systems (like most "industry" standards) nor of the owners of the distribution media (like CCITT "recommendations").
At least two significant principles of SGML can be attributed to this heritage:
In typical SGML applications, the element structure represents the abstract (or "logical") information and non-SGML means (e.g., Postscript) are used to represent the perceivable representations (although additional element structures can be used for this purpose as well, as in HyTime). The entity structure represents the storage.
The distinction between the abstract information and the storage is inherent in information processing and predates the computer. For example, whether a book is published as one large volume or several smaller ones has no bearing on the abstract information. Similarly, there are important properties of the book, such as its location and the access rights to it, that are most usefully considered to be part of the storage structure because changes in the abstract information will not directly affect them.
The problem with providing a storage representation for information is that it inherently conflicts with the need for system-independence, since physical storage of information is one of the primary services of a computer system. SGML resolves this conflict by interposing the entity structure as a virtual storage system between the element structure(s) and the real storage system(s); that is, the file system, data base, distributed object store, etc. The portion of an SGML system that maintains the relationship between entities and real storage is known as the "entity manager". It can be thought of as a service that provides access to storage for its clients -- the SGML parser and SGML application and utility programs.
The location of an entity in real storage is specified in a markup declaration parameter known as an "external identifier".
The occurrence of an external identifier always means that an entity is (explicitly or implicitly) being declared. There are ten categories of declaration (or declaration equivalent) in which external identifiers can occur. They are listed in the table below, along with the types of entity they can identify.
The types of entity are:
External identifiers occur in entity declarations, of course, but also in notation, document type, and link type declarations. An external identifier is also specified when parsing is invoked, to identify the undeclared SGML document entity (see 2.2, below). Several external identifiers can be associated with the SGML declaration (see 2.3, below).
Declaration Category Entity Type Special Use --------------------- -------------- ------------------------------------ DOCTYPE SGML entity External subset ENTITY SGML entity data entity program entity (default ENTITY) (like ENTITY, but not declared until reference occurs) (undeclared ENTITY) SGML entity Document entity in which parsing begins LINKTYPE SGML entity External subset NOTATION data entity Specification of data content notation program entity Interpreter of data content notation (implied SGML) SGML entity SGML declaration SGML: BASESET SGML entity BASESET parameter SGML: CAPACITY SGML entity CAPACITY parameter SGML: SYNTAX SGML entity SYNTAX parameter
Parsing always begins in an SGML document entity for which there is necessarily no entity declaration. Instead, an external identifier (or information from which an external identifier can be derived) must be specified when processing is invoked.
The SGML declaration can be "implied" for documents processed by a "single system". The practical definition of "single system" in this case is "environments sharing the same implied SGML declaration". (In other words, you don't need the SGML declaration if you are certain that all recipients will imply the same SGML declaration that you did.)
As with the undeclared SGML document entity, there is no declaration for an implied SGML declaration entity. The external identifier occurs only in communication among the user, application, parser, and entity manager.
Note: An implied SGML declaration could be identified in a non-standardized way, such as by a processing instruction like:
<?SGML "ISO 8879:1986" PUBLIC "-//USA-DOD//SGMLDECL MIL-M-28001//EN" >
External identifiers also occur in the BASESET, CAPACITY, and SYNTAX parameters of the SGML declaration. The SGML parser treats them as both declaring and referencing an SGML entity.
Entity names are not created by the entity manager. However, there are services in which an entity name can be useful, such as generating system identifiers (see 3.5, below), which is why entity names are discussed here.
Every entity that is declared or referenced explicitly in an SGML document has a natural entity name that is unique within the name space in which it occurs (that is, within the document or subdocument). Entities that are declared implicitly by external identifiers, however, do not have natural names. However, it may be useful to assign names to them for purposes of communicating between the user, application, parser, and entity manager (although not necessarily so: it depends on the parser and entity manager services and interfaces, error message formats, etc.).
The table below states how an entity name can be determined for each category of declaration. The sample name is preceded by an asterisk if it could conflict with an entity name occuring in the document. (In such cases, the created entity name must contain a character that is not allowed in a natural entity name, or the name must be used only in contexts in which the declaration category is used to distinguish it from natural entity names.) The sample name is in upper-case, indicating literal text, if generated by the parser. It is suffixed by "nn", representing a serial number, if the name could be generated for more than one entity.
Declaration Category Sample Name How Name is Determined --------------------- -------------- ------------------------------------ DOCTYPE *dtname Document type name from declaration ENTITY entname Entity name from declaration (default ENTITY) entname Entity name from reference (undeclared ENTITY) *SGMLDOC Generated by parser LINKTYPE *ltname Link type name from declaration NOTATION *notname Notation name from declaration (implied SGML) *SGMLDECL Generated by parser SGML: BASESET *BASESETnn Generated by parser multiple times SGML: CAPACITY *CAPACITY Generated by parser SGML: SYNTAX *SYNTAXnn Generated by parser multiple times
An external identifier can be a "system identifier". Its purpose is to specify the location of the entity in real storage. (It is therefore egregiously misnamed: it is a "locator", not an "identifier".)
There is no general requirement in SGML that a single entity occupy a single real storage object (file, data base record, etc.). An entity can be distributed over several storage objects. A system identifier, therefore, is a list of "storage object specifications (SOS)", each of which must contain the storage system type and the identifier of the storage object within that system.
Typically, the storage system type will be the local file system and the storage object identifier will be the file identifier. (An implementation could choose to optimize its system identifier syntax for this case.) However, the storage object identifier could include additional parameters for compression, enciphering, or other purposes defined for the individual storage system.
For example, in storage systems such as archives (ZIP files) or data bases, the storage system type would identify a program, and the storage object identifier would consist of the ZIP file or data base name and parameters to identify the record, field, etc. (Although it would be possible to allow a full-blown query, to do so would be inappropriate. The purpose of a system identifier is to specify the location of known information. Queries are used to determine which information is wanted, and are better used at higher levels of the application.)
A storage system, if desired, could implement "identifier enhancement rules" for modifying a specified storage object identifier (for example, a DOS or UNIX system might prefix a filename with a drive and path taken from an environment variable).
The functions of an entity manager are such that portable code can be written for them, just as for an SGML parser. The system-dependent aspects are handled by "storage managers" (corresponding to storage system types) that are "enrolled" in a particular SGML system environment. The entity manager is responsible for cycling through the SOS list and calling the storage managers.
Storage managers are not restricted to persistent storage. For example, a word processor or text editor that created SGML documents in memory buffers could enroll its buffer manager as a storage manager.
For entities that are read by the SGML parser or clients, the entity manager performs "open", "read", and "close" functions by invoking corresponding functions in the storage managers named in the SOS list. The read function updates the current location, which is available to the client through an entity manager service. In order to support the rereading and reparsing required for multiple-character delimiters-in-context, the storage manager must also provide methods for "unreading" characters or retarding the current location ("set position").
Note: Erik Naggum has proposed an alternative to the conventional "read" and "unread" services which allows for more efficient implementation. Instead of reading text and updating the location, then unreading text and revising the location again, a parser would perform a trial read (a "peek") of a specified number of characters. A peek would not change the current location in the entity. A parser could therefore inexpensively repeek and reparse until the correct parse is determined. At that point it would use an additional service ("step") to advance the current location the correct number of characters.
An entity need not occupy all of a storage object. It could, in fact, be distributed over several non-contiguous extents within that object. An SOS must include a specification of the substring(s) of the storage object that are occupied by the entity.
Note that the frequently-voiced criticism of SGML that "you can't have text and graphics in the same file" is actually a failing of existing entity manager implementations, not of the language. Although the text and graphics must be in separate entities, the entities could occupy different substrings of the same file.
Substring information is passed to the storage manager when a storage object is opened. The storage manager is responsible for returning only the specified substrings to the entity manager.
A standardized technique that can be applied to substring specification is HyTime dimension specifications (ISO/IEC 10744, clause 7.1). The "addressable range" is the number of storage units (usually bytes) in the storage object, only literal axis markers can be specified (that is, no dimension references, et. al.), and only "error" or "trunc" can be used for overrun handling.
Some text files preserve input line terminations. The SGML standard requires that, if separate input lines are to be presented to the parser, they must be presented as "records", bounded by SGML record-start and record-end characters.
An SOS for an SGML entity must specify how record boundaries are indicated so that the entity manager can remove line terminators and replace them with record boundaries. It should be possible to specify any of the common line terminator schemes (CR, LF, CRLF, LFCR) or a "record management system" (RMS) in which a line terminator is implied after each physical storage "read".
A full record boundary (RE RS) is inserted in place of each line terminator that is followed by other text in the entity (not necessarily in the same storage object). An RE alone is inserted in place of the trailing line terminator (if there is one).
SGML entities are frequently stored in storage systems that are not aware of SGML (as opposed to an SGML-aware interchange format or archive, such as those described below in 5.0). Such storage allows SGML entities to be processed by applications that are not SGML-aware, such as text editors and file utilities. However, such applications usually require use of the storage system's normal line terminator conventions. Therefore, if record boundary insertion has already been performed for an entity, its storage object specifications should so indicate, and should identify the characters used for the RE and RS so that conventional line terminators can be substituted if the entity is ever stored in non-SGML-aware storage. (For example, the record boundary insertion parameter could be "RERS 13 10".)
There is no record boundary insertion for data entities; "NONE" must be specified. However, "NONE" must not be specified for SGML entities that do not require record boundary insertion -- "RERS" should be used instead. The reason for this is so that "NONE" can unambiguously identify a data entity in cases where a parser, for example, could be reading either the SGML version of a DTD (SGML entity) or a pre-parsed version (data entity).
If "NONE" or "RERS" is specified for an SOS, it must be specified identically for all SOSes in the system identifier.
The system identifier for an SGML entity should also allow the user to insert record-starts and -ends and spaces between storage objects.
Record boundary insertion requires that the parser advise the entity manager which characters to use for RE and RS and SPACE for a given SGML entity. (This information is the only information about a client that the entity manager needs to know.)
SGML allows an external identifier to consist of an "omitted" system identifier. It is then up to the application or SGML system to generate one. The generated system identifier is treated like a specified one; in particular, it is subject to the application of identifier enhancement rules by the storage manager (see 3.1, above).
An omitted system identifier is frequently used in a default entity declaration.
Note: An empty system identifier ("") is not the same as an omitted one. Whether an empty system identifier is meaningful depends on the storage manager.
Note: One technique for handling an omitted system identifier is to assume it refers to a complete single storage object in the local file system and to allow the storage manager to generate the SOS from the entity name and declaration category (see 2.1, above).
When information is shared among several sites, it is normally inconvenient (or impossible) to reference it by means of its location (i.e., by its system identifier). That would be akin to a bibliographic citation referencing a book by its shelf position in a library. SGML therefore allows an external identifier to be a "public identifier" -- a character string that identifies an information object, but does not specify its location.
When an external identifier is only a public identifier, the entity manager must be able to map it to a system identifier (just as a library card catalog maps book titles to shelf locations). If the external identifier also includes a system identifier, then there is no need for the entity manager to consult a catalog, as the specified system identifier is the one to be used.
The format of a public identifier catalog is a list of external identifiers in which both the public identifier and system identifier are present.
A catalog could have associated "mapping rules" for generating a system identifier when no catalog entry exists for a particular public identifier.
Note: One technique for implementing mapping rules would be to create a template for the system identifier, with one or more placeholders for the entity name, declaration category (see 2.1, above), and/or the components of the formal public identifier -- owner identifier, public text class, public text description, public text display version, et. al. There could either be a single template, or a separate one for each declaration category.
There can be more than one version of an entity. For example, there can be definitional and multiple display versions of character entity sets, as described in the SGML standard. Other examples are readable specification and executable interpretation versions of a notation entity, and "pre-parsed" parser-specific versions of document type and link type declaration subsets.
An SGML "formal public identifier" has an optional field for specifying the desired version. The standard provides that the system will choose the best version if none is specified. An entity manager can meet this requirement by allowing its clients (SGML parser and application) to specify a version if there is none in the public identifier, and, if they choose not to do so, by having an entry in the mapping catalog for the versionless form of a public identifier (in addition to entries for specific versions).
Note: When parsing the prolog of an SGML document, the SGML parser does not normally interact with the application, so the application cannot specify versions dynamically. Instead, it can create a "version table" at the start of processing that specifies (or omits) a version for each category of declaration (see 2.1, above). The parser can consult this table to find the version to specify for each entity.
The SGML standard restricts the public text classes and versions of formal public identifiers when they occur in documents. However, these restrictions need not apply to public identifiers that only occur outside of documents; for example, public identifiers that are specified in command lines or dialog boxes, or that are generated by a parser or application. For such "unrestricted formal public identifiers":
Note: Unrestricted formal public identifiers cannot be used as formal public identifiers in conforming SGML documents.
An interchange facility allows one or more SGML documents and related information objects (such as style sheets and notation interpreters) to be combined into a single storage object in a way that preserves the entity structure, while providing freedom from the system-dependent aspects of the originating platform's storage systems.
In addition to several "industry-standard" archive file formats ("ZIP files") , there are two SGML-aware formally standardized interchange facilities:
The impact of these facilities on the entity manager is described below.
An SDIF packer is a utility program that combines one or more SGML documents into a single data structure in a way that preserves the entity structure. The data structure is comprised of "descriptors", each consisting of an entity preceded by a header that gives its name, type and length. The user can tell the packer which entities to include or omit.
When a system receives an SDIF structure, it employs an "SDIF unpacker" utility to store the entities in appropriate storage objects. At that time, the unpacker changes the system identifiers in the received documents to reflect the new environment.
All entities in an SDIF structure are single objects, even if they were multiple objects in the originating system. However, the system identifiers, if present in the document, still contain the original SOS lists, which may be of interest to the SDIF unpacker when determining how to store the entities in the receiving environment.
SDIF is not organized for direct access and is not intended to be used as a storage system.
The SDIF packer need not perform record boundary insertion for SGML entities that are stored as a complete single storage object. An entity manager service indicates when an entity satisfies this condition. For SGML entities that do not, record boundary insertion must be performed when packing the objects into a single SDIF descriptor. The packer also changes the originating system's SOS to specify the RE and RS character numbers as the line terminators.
An SDIF packer also requires an SGML parser service that returns all external identifiers, so that the packer can locate the objects in which the entities are stored. The packer reads all the declared entities and writes them into the SDIF data structure. The external identifier parser service must also indicate when an external identifier identifies document entities and SGML subdocument entities, as the packer then calls on the external identifier parser service to locate other entities to pack. Depending on the user's requirements, the parser service will return either all external identifiers in the prolog, or only those found when the document is parsed with respect to a specified DTD or LPD.
Note that only entities declared in the SGML declaration or prolog (explicitly or implicitly) with external identifier parameters are packed. The document instance set is not considered. Entities declared only by a default entity declaration will not be packed.
SBENTO stores multiple entities within the storage of another entity, called a "container", which is itself stored contiguously within a single storage object. Within the container there are no "headers" to identify and separate the entities, as are found in an SDIF structure. Instead, a separate "table of contents" (TOC) entity describes and locates each entity in the container ("contained entity").
The contained entities can be distributed over several extents of the container (like substrings of a storage object), and two or more contained entities can have locations in common. For example, the color table of a graphic object can be an entity, as well as being part of a larger entity containing the entire graphic.
The TOC entity is an SGML document or subdocument entity and the TOC information occurs in the form of data attributes. The location of a contained entity is specified as an axis marker list, just as substrings of a storage object are, but the numbers refer to the entity (i.e., after record boundary insertion) and are therefore portable across systems. Because of the direct access enabled by the TOC, and the system-independent addressing, a SBENTO can act as a portable self-contained storage system for a document or hyperdocument. Unlike SDIF, it need not be "unpacked" after interchange, although it can be.
SBENTO requires special support in an SGML parser at the point of interface with the entity manager. The parser, which must be HyTime-aware, recognizes container and contained entities from data attributes on the entity declaration. For container entities, it notes that the entity is a SBENTO and stores its unitsize property (number of bits per addressable storage unit). For contained entities, it converts the entity-based axis markers into a substring list for the container storage object.
As SBENTO is an SGML-aware storage type, SGML entities in SBENTO contain RE/RS record boundaries exactly as they are to be seen by the SGML parser. No record boundary insertion is performed by the entity manager when accessing such entities.
The TOC and container entities of a SBENTO can be combined into a single data structure for interchange by using SDIF. However, the default interchange format is simply to concatenate the container to the TOC.
There are many archive file formats, such as the popular PKZIP file format found in DOS systems, that not only combine multiple storage objects, but provide the added benefits of compression and enciphering as well.
Many archive formats support a hierarchical structure, which can be used for the separate name spaces required by multiple SGML document and SGML subdocument entities. For this reason, such formats could be used for the "encoding" of SDIF data streams.
Storage managers can easily be created for archive file formats, thereby allowing entities to be kept in compressed and enciphered form until they are actually needed.
A simple "SGML archive" could be supported that would allow maximum portability of documents without modifying the system identifiers. The invocation of the parser would include an "archive name" parameter that serves as a nickname for the SGML document entity in which parsing begins. System identifiers for entities included in the archive would specify "SGML archive" as the storage system type, and the archive name as the storage object identifier, thereby indicating that they are stored in the same storage object as the SGML document entity.
Archive files typically need to be created by SGML-aware programs that can create the appropriate system identifiers.
When a program isn't executing, its representation is a data object that is susceptible of manipulation in the same way as other data objects. In particular, the executable versions of data content notation interpreters, methods for the rendition of multimedia objects, and other useful programs can be associated with SGML documents and included in interchange formats.
Because of the large variety of execution control methods and relationships in use by operating systems (including, for example, asynchronous processing such as calling a notation interpreter to operate on the content of an element that has not yet been parsed), the storage manager interface is not intended to support execution. Instead, the entity manager provides a method that returns the system identifier, and the client uses it as an argument list when calling the appropriate system service to execute the object.
A program entity is declared like any other data entity. It is referenced by specifying it in the value of a general entity attribute.
Interfaces between the entity manager, its clients, and the enrolled storage managers, occur in two forms: data interfaces and programming interfaces.
All critical information in SGML is expressable in data interfaces, as that is the only form of interface that can be independent of applications and systems. For entity management, the data interfaces are defined by International Standards, except for the format of the system identifier. An entity manager can use any syntax for this purpose that meets the requirements of 3.0, above.
The precise form of the programming interfaces between the entity manager, its clients, and the storage managers is not critical to achieving the objectives of SGML entity management. What is critical is the distribution of such tasks as identifier generation, error-handling, and fallback among those parties. The following scenarios are intended to be illustrative as to the techniques used for those tasks, but definitive as to who is to perform them.
Note: These scenarios should not be considered to be a complete list, nor are they intended as complete specifications of programming functions.
There are three groupings of scenarios:
1) Process external identifier
If the external identifier contains a system identifier, the procedure "process system identifier" is followed. If not, the procedure "process public identifier" is normally followed, or the procedure "test and process public identifier" can be followed.
2) Process system identifier
The system identifier is communicated to the entity manager, which constructs the entity "object" and returns a pointer to it. If the record boundary indicator for the storage objects is "NONE", the entity is marked as a data entity. If not, it is marked as an SGML entity.
If the specified system identifier is an omitted system identifier, the application can either generate one, or call on an entity manager service for that purpose (see 3.5, above). This service can be combined with the construction service.
Note: If the context in which this procedure is performed allows the client to know whether the entity is supposed to be a data entity, it could also perform the "check for data entity" procedure to determine whether the SOSes are correct.
3) Process public identifier
The public identifier is communicated to the entity manager, which performs the "search public identifier catalog" procedure to obtain a system identifier with which it performs the "process system identifier" procedure.
4) Search public identifier catalog
The entity manager searches the public identifier catalog for the specified public identifier. If found, it returns the associated system identifier. If not found, and mapping rules are associated with the catalog, it returns a generated system identifier. If there are no mapping rules, it returns an error.
If the public identifier is a formal public identifier and its version field is empty, the parser can supply a version from the version table created earlier by the application. In this case, the entity manager will automatically search the catalog for the versionless form of the formal public identifier if the form with the parser-supplied version cannot be found.
5) Test and process public identifier
The entity manager performs the procedure "search public identifier catalog", and, if a system identifier is returned, the procedures "process system identifier" and "check storage access". If the latter fails, the application is then free to exercise its own fallback procedures. Such procedures could include searching the catalog for other entries for the same public identifier.
6) Check storage access
If an object has not be constructed for the entity, an error is returned. Otherwise, the entity manager calls the storage manager(s) to make sure that all objects in the SOS list are accessible to the client. If not, it returns an error.
7) Check for data entity
The entity manager returns "TRUE" if the entity is a data entity; "FALSE" if it is an SGML entity.
8) Get system identifier
The entity manager returns the system identifier of the specified entity object.
1) Start of processing
The application extracts or derives from the command line or user dialogs:
The application invokes the SGML parser and communicates the above information to it.
2) Start of parsing
The SGML parser invokes the entity manager to perform the "process external identifier" procedure and read the SGML document entity.
3) Implied SGML declaration
If the parser encounters a DOCTYPE declaration before having encountered an SGML declaration, it invokes the "process external identifier" procedure for the implied SGML declaration entity. It then invokes the "check for data entity" procedure and reads the entity in either SGML or pre-parsed (data) form.
If the application did not communicate an external identifier for the implied SGML declaration entity, a defaulting mechanism is used to determine one.
Note: A simple defaulting mechanism for the implied SGML declaration is an entry in the public identifier catalog for the reserved public identifier "DEFAULT IMPLIED SGML DECLARATION".
The application (perhaps using information acquired from the user) establishes a version table for use by the parser.
4) SGML declaration BASESET, CAPACITY, or SYNTAX parameters
The SGML parser invokes the entity manager to perform the "process external identifier" procedure. It then invokes the "check for data entity" procedure and reads the entity in either SGML or pre-parsed (data) form.
5) DOCTYPE or LINKTYPE declaration
At the end of the internal subset, the SGML parser invokes the entity manager to perform the "process external identifier" procedure. It then invokes the "check for data entity" procedure and reads the entity in either SGML or pre-parsed (data) form.
6) ENTITY or NOTATION declaration
The SGML parser invokes the entity manager to perform the "process external identifier" procedure.
1) General entity name attribute
Depending on the type of entity and the semantics of the attribute, the application decides whether, for a:
2) Undeclared general entity name attribute
The parser invokes the "process external identifier" procedure using the external identifier from the #DEFAULT entity declaration. (Remaining processing is the same as for "general entity name attribute".)
3) Syntactic entity reference
If the entity is an SGML text entity, the SGML parser invokes the entity manager to read the entity. Otherwise, processing is the same as for "general entity name attribute".)
4) Undeclared syntactic entity reference
The parser invokes the "process external identifier" procedure using the external identifier from the #DEFAULT entity declaration. (Remaining processing is the same as for "syntactic entity reference".)
5) NOTATION attribute
Processing is the same as for data entity and program entity in "general entity name attribute".
The entity structure of SGML provides a virtual storage system that insulates the element structure of a document from system dependencies, while still allowing storage-related properties to be specified. It is literally the foundation of SGML, as it allows the SGML-described application structure to exist within the normal storage facilities of computer systems, and SGML objects to be processed by non-SGML-aware programs. At the same time, the entity structure allows SGML-aware storage facilities to be used, but without requiring their exclusive use, and while preserving the ability to move objects between SGML-aware and non-SGML-aware storage.
An SGML system requires a properly-designed entity manager in order to give the user the full benefits of the entity structure. Such an entity manager must include support for multiple storage systems and line terminator conventions, and the ability to divide an entity over several storage objects, or, conversely, to store several entities in the same storage object.
An entity manager with the capabilities described in this paper could be implemented to be an independent service, just like an SGML parser. It would prove to be as valuable to SGML systems and applications as the parser itself.