[Archive copy mirrored from: http://www.ornl.gov/sgml/wg8/9573ent/ENTITIES.HTM, descriptive test only]

Sample collections of entities and glyphs (proposed) for potential inclusion into ISO 9573. For: Ugaritic, Old Persian, Glagolitic, Croatian, Buginese, Cherokee, Gothic Uncials. Developed by Anders Berglund (and others).

SGML Public Entity Sets, Proposals

What is an SGML Entity and a Public Entity Set?
What is the Repertoire in a Set
Guidelines for the different parts of an entity definition
Proposals

What is an SGML Entity and a Public Entity Set?

SGML, ISO/IEC 8879:1986, contains a mechanism to refer to characters, syllables and symbols that are not to be found on normal keyboards or that are difficult to store and transmit unambigously. It is acheived by defining so called (SDATA) Entities, where one has essentially given a name to a character, syllable or symbol and is assuming that a system processing the SGML data will be able to understand the reference, either by its name or the so called replacement text. To refer to an entity in an SGML file the name is prefixed by "&" and followed by ";". For example α to refer to the greek alpha. ISO has published some number of collections of entities; the Public Entity Sets, and work is in progress to add a large number of entity sets for non-latin languages.

For the purposes of reviewing and commenting on the sets the name and comment are the only relevant parts. The pubished entity sets also refer to characters, if present, in ISO 10646 as well as to entries in the International Glyph Registry, for which AFII is the registrar.

What is the Repertoire in a Set

A large number of the entities represent characters. For cases where presentation forms exist and where it is desirable to be able to easily refer to a particular form entities have been created for these. For example, up to five entities have been defined for each Arabic letter - one as a character, four when it is required to be able to specify one of the four presentation forms.

For entity sets representing scripts of scholarly interest additional entities are included to enable recording of variations that are important for research purposes. In such cases there is normally a "nominal" entity representing a character or syllable that can be used to record texts where variations are not important. In addition there are entities for each signifficant variation of a character or syllable that may be used in those studies where variations are important to record. Thus for example if a character has two distinct presentation forms there would normally be three entities for it.

Guidelines for the different parts of an entity definition

Entity name - in practice this is what is "standardized". The guidelines for the name is:
- conforms to the SGML reference concrete syntax (which restricts the length of the name to 8 characters)
- the last two letters identifies the script - if there is "more or less" a one to one correspondance between a script and a language then any two letter ISO code for the language is used.
- the other (up to 6) some meaningful contraction of the letter/syllable - either name of letter/syllable or how the sound(s) would be transliterated into English
replacement text - in the canonical form just "[" || entity name || "]"
comment in entity declaration - some meaningful description of the letter/ syllable (so someone could look at a font and pick out the right glyph - requirement of knowledge of the subject permitted). For political reasons this comment is selected from (in order of preference):
- ISO/IEC 10646 character name
- comment in the AFII glyph registry
- comment created specially for the entity set

Proposals

Be warned that the Web page for a proposal contains a number of gif images showing a typical glyph for each entity. Display may thus be slow...

The proposed entity sets will, shortly, also be available as a zip file containing a scanned tif image of the proposal.

Please send comments on the proposals to Anders Berglund; bcatf@ibm.net.