[This local archive copy mirrored from the canonical site: http://tag.sgml.com/7030101.htm; links here may not have complete integrity, so use the canonical document at this URL if possible.]

® The SGML Newsletter

Article from March, 1994











Formal Public Identifiers

by David C. Peterson

Most of us have run across "public identifiers" at one time or another. Most of us don't really know what all the parts of a "formal" public identifier are, and what those parts are for--especially the optional parts. That's what I hope to cover in this article.

Public identifiers are one of two kinds of external identifiers used by SGML systems. To put this article in context then, I'll start off by reviewing just what external identifiers are, and what they're for.

External identifiers are character strings that identify external entities (and notations and character sets, but this article won't cover them in any detail) to your SGML system's entity manager. A system identifier is a system-specific external identifier; with most SGML systems, every external entity must have one. However, SGML was designed for interchange between systems. Therefore, SGML provides for alternate external identifiers--public identifiers--that are, by definition, not system specific.

The rule is that in your DTD, declaration of an external entity must specify a system identifier for that entity unless the system can identify the entity without it. Many entity managers must have either a system identifier or a public identifier, but most--when presented with an appropriate public identifier--can do without the system identifier. Often, the entity manager (a program or subroutine) will have an internal table, created by the system manager (a human) that maps public identifiers to system identifiers.

System identifiers and public identifiers are both pretty much arbitrary character strings. Public identifiers are restricted to minimum data characters, which are upper and lower case, unaccented roman letters, the ten arabic digits, line breaks and space characters (no tabs!), and these special characters: '()+,-./:=?

(Note that this list doesn't include the double quote character!) An SGML system will also normalize any potential public identifier by stripping white space fore and aft, and condensing any interior white space to single non-adjacent space characters. So in your SGML declarations you can place a line-break in a public identifier at any space, and indent the remaining part for readability and still get the same result.

SGML also provides for a special kind of public identifier, the formal public identifier. That's what this article is really about.

Okay. What Makes a Public Identifier "Formal"?

A formal public identifier has an internal structure. From it you should be able to tell:

  • who owns or created the entity,
  • what SGML use the entity is intended to have,
  • what language it is written in or for,
  • what system it is specialized for,
  • whether or not the entity is generally available, and
  • what the owner calls the entity.

Each of these items of information can be found in a substring of characters, a specific part of the whole string. With one exception, each part is separated from the next by a pair of slash characters (‘/’, aka solidus or virgule). Official SGML terminology groups the public identifier into two pieces, the owner identifier (which identifies the owner, of course) and the text identifier (which does everything else); there is a ‘//’ between them.

The Owner Identifier

There are three possibilities: Either the owner has a "registered" name, or he (or she or it--the owner is often a corporation or government body) doesn't, or ISO (the International Organization for Standardization ()) is the "owner"--which generally means that the entity either is an ISO document or is defined in an ISO document.

An unregistered owner identifier is the simplest. It must start with ‘-//’, followed by the "name": any minimum data characters not including ‘//’. Each person or organization can select the name or names they want to use. Because it's not registered, there's no guarantee that no one else uses the same name.

A registered owner identifier is the same, except that it starts with ‘+//’. The difference is that there is a registration authority prescribing ways to insure that different people and organizations use different names.

An ISO owner identifier always starts with ‘ISO’. What comes after is dealt with by ISO.

Okay. So far, we know that a formal public identifier with an unregistered owner looks like this:-//owner's "name" //text identifier.

On to the text identifier.

The Text Identifier

The text identifier consists of five pieces; two of which are optional. They are the public text class, the unavailable text indicator, the public text description, the public text language (except for public identifiers of character sets), and the public text display version. It's the unavailable text indicator and the public text display version that are optional. I'll go through the parts in order--left to right in the text identifier.

The Public Text Class

The public text class is a single SGML name, that tells what kind of data is in the entity--or what the public identifier is naming, if it names something other than an entity (it might be a notation or a character set, neither of which will I cover in this article).The name must be one of the ones in Table 1. There is a strange oversight in ISO 8879, however: There is no keyword provided for external SGML data entities! (And you aren't permitted to make up your own.) Presumably that will be fixed during the review of ISO 8879 that is going on right now. (Rest assured the review group is aware of the problem.)

The public text class is always followed by a space character to separate it from whatever is next. (Remember that no matter what white space you put in, it gets normalized to one space character.)

The Unavailable Text Indicator

The unavailable text indicator (‘-//’) is optional. If used, it goes right after the space character after the public text class and immediately before the public text description. It indicates that the owner of the entity has restricted the people and/or organizations to whom the entity can be provided. If it's omitted, then the owner (who presumably made up the public identifier) is asserting that the entity is somehow available to the general public, although there may be a fee payment or other conditions the general public must satisfy to get it.

The Public Text Description

This is whatever text the owner chooses to identify the entity and differentiate it from other entities that the owner might also own. It consists of any minimum data characters (described above) except for two adjacent slashes, and is completely arbitrary except that white space bands will always be replaced by a single space character. It is always followed by two slashes and then the public text language.

The Public Text Language

The public text language is a code for whatever language the entity is written in or with which it is intended to be used. It is normally the two-character code from ISO 639. You are probably used to seeing it as ‘EN’, the code for English.

The Public Text Display Version

Like the unavailable text indicator, this one is optional. If used, it goes after the public text language code, separated by another pair of slashes; it consists of any minimum data characters, and as usual, white space bands become single space characters. It can't have any white space at the end--it would get stripped.

An example of how the public text display version might be used is to differentiate between various versions of an entity set containing SDATA entity declarations for special characters--one version for each display system.

...And that's the whole thing! There's a sample formal public identifier broken into its parts in Table 2. When written out, there would be no white space between the parts. </>

Bio: David C. Peterson (davep@acm.org) is an SGML consultant for SGMLWorks! He is a Technical Expert for ISO/IEC JTC1/SC18/WG8, which oversees SGML.

Abstract: Dave Peterson discusses all the parts of "formal" public identifiers, including the optional parts. Public identifiers are one of two kinds of external identifiers used by SGML systems.

Prev Article Table of Contents Next Article

<TAG> is a registered trademark of SGML Associates, Inc.
All copy and information on this site is copyright © 1996 by SGML Associates, Inc.
Last modified 6-September, 1996.
Feedback to Webmaster
Last modified: 01/06/1997 9:21 p.m.