Statements on objects in SGML based document design

In the following overview I do not go into the object-oriented model on which the document design is based. The 'system' is assumed to be built out of objects only, following a SmallTalk or Actor approach. For instance, EXPRESSIONs and BLOCKs are objects themselves, and can be created and saved as any other object.

In the design I have avoided constructs that defy some essential object-oriented principles:

Object identifiers: where possible, addressing strategies are mapped onto object identifiers.
Class objects: information shared by objects of the same type is stored in a class object. Classes are objects that store shared information in 'class attributes'. Instances (including classes) store local information in 'instance attributes'.
Type relations: when objects of two types behave similarly in all situations they are type related.
Classes versus types: in the TBMS types are implemented by classes.

The objects described here model the following characteristics of the structural model of SGML, as extracted from the standard. The structural model can be said to adhere to the element structure information set (ESIS, WG8, N1035), as described in the SGML Handbook by C. Goldfarb (1988).

Translators

Objects are created by evaluating expressions.
Objects may be created that translate expressions in some language into expressions understandable by the TBMS.
Examples of such objects are SGML (for SGML expressions, HYQ (for HyQ expressions) and PROGRAM (for program expressions).
SGML is a language for describing the structure of a document as it is exchanged. An abstract model of text is implied. This is the structural model of SGML (language structure). The actual form in which the abstract model is applied is the exchange model of SGML (language form).
Translation of SGML expressions into TBMS expressions is performed by an SGML object. The expressions created may be executed by the system, which results in a collection of objects.
The objects generated are subject to principles of subtyping and object aggregation / association.

(20321 Bytes)

The preceding image shows the type relations in the TBMS.

Nodes in a free tree

A DOCUMENT (or simply document) is an object that represents a tree of nodes. All objects making up the tree structure are a subtype of NODE.
Nodes may have a single parent node and any number of ordered child nodes. In all circumstances, nodes relations for tree structures are enforced.
Nodes are said to cover their descendants.
We may traverse a tree by going from one node to another, following node relations.
A path is any sequence of nodes within the same tree. Node traversal implies a path through the tree.
All paths known for any node constitute the context of the node. The parent, children and left and sight sibling constitute the immediate context of the node.
As nodes can only access their immediate context special object should keep track of path traversal into some context rather than some immediate context. Such objects are called streams. Paths through a tree are accessed by NODESTREAMs, or simply node streams.

Units in document content

The nodes make up the document's content. Such objects may be elements (type ELEMENT), data portions (type CDATA), or objects that represent other objects (type NOTATION). These UNITs (or simply units) behave like nodes in all circumstances. Therefore, UNIT is a subtype of NODE.
ELEMENT units (or simply elements) may record 0..n ordered child units, and are by nature non-terminal units.
CDATA units (or simple cdata units), that represent character data portions, cannot have child units. They are terminal units.
NOTATIONs (or simply notations) are units that represent some non-unit embedded in element content. Notations are terminal units.
DOCUMENT is not a unit, but rather an object that records information shared by the units in content.
Nodes shouldn't be equalized to units, but have an independent status in the type hierarchy. This is because the node behavior also applies to model units, i.e. objects representing the content model of some element.
The UNIT type is abstract: the type outlines expected behavior of all types of nodes in the document tree.
The UNIT is a way to collect all structural and behavioral features of objects that may occur in a document. All units render services that are based on the fact that they are nodes that are associated with a text portion, have a status, and may predict units in context.

(24253 Bytes)

The preceding image shows how units make up a document tree. Left the document, middle the units, right the associated text portions.

Special markers

Units have a status: units may be manifest, hidden, and/or temporary.
A unit may be marked as included or ignored. This mark is called the unit status. When a unit is a descendant of any ignored unit, it is 'hidden'. In all other cases it is 'manifest'.
The status is recorded symbolically (by name) by the document the unit is part of. Units thus record the name of a status value, which itself is recorded by the document.
The status may restrict access to complete subtrees or text portions ('hiding' sections of the document).
A unit that is ignored does not play a role in prediction.
A unit may also be marked as temporary. In this case the unit maintains its status but may signal its temporary nature when required. Each unit is temporary when its parent is temporary, or when it is marked as temporary itself.
A unit type may be breaking or continuous. When some type of unit is breaking, instances of that type do not allow searches for data to cross the text they cover. All data units are continuous. All units representing external objects are breaking. Element subtypes may be breaking or continuous. This is a characteristic of a unit's class.
All units are associated with some character data sequence, loosely called the text. The text is represented by a TEXT object. Updating and searching the text is a behavioral feature of all units. In rendering these services the units cooperate with the TEXT instance.
A unit is said not only to cover its sub-units, but also the text it is associated with, and all possible sub-texts. Subsequently, a unit completely covers the text its sub-units cover.

Elements

ELEMENTs collect other units according to a predeclared model of content, maintain a set of SGML attributes, are identified by name by a document instance, and so on. They are the fundamental objects of structuring.
The ELEMENT type is abstract: one can create a 'title element', but not an 'element'. TITLE-ELEMENT is a kind of ELEMENT (subtype).
SGML general identifiers for elements are mapped onto element types.
Elements are typed in the context of a document: a 'line' in a document of some type is assumed to be different from a 'line' in another document type.
Elements of the same type within a single document are similar, and should all be implemented by the same class.
Specific element types may show specialized behavior, expected from the document substructure they cover.
Documents records a single document element, which is the root of the tree.
Elements not only cover units, they may also reference objects.
References to objects are valid only within the framework of the document the referencing units belong to. Therefore, references are symbolical (by giving some object a name). Reference names are associated with referenced objects by the document.
Elements may reference elements that are part of the same document (cross-referencing). Elements cannot reference elements in other documents.
Elements may reference objects that are not part of the document (external objects).
The set of names of external objects differs from the set of element names. Therefore a name may refer to both an external object and an element in the document.

SGML attributes

Information related to elements that is not expressed in content is stored in SGML attributes.
The set of possible SGML attributes of an element is the same for all elements of the same type.
The description of the set of SGML attributes is recorded by the class object implementing that type of element.
The set of actual SGML values for all SGML attributes is recorded by each element of the type.
SGML attributes may be used to streamline the behavior of that specific element type.
Some SGML attribute values must be set, some may be implied when not explicitly set, some SGML attributes are fixed to a value, some are extracted from attribute values of other elements in the same document, some are extracted from the content of the element.
Some SGML attribute definitions are mutually exclusive.
SGML attributes differ from object attributes because the impliable value of SGML attributes must sometimes be inferred from other units in context (current value, content reference attribute).
An element is attribute incomplete when an attribute is required but not set, or cannot be implied. In other cases the element is attribute complete.

Content model

To know what type of units may occur in the unit's immediate context we require the unit to predict such unit types. Prediction sets therefore give guidance in document updates.
The type of a left sibling unit, right sibling unit and first child unit can be predicted.
The types of units that may occur at some location in the document are represented in a content model. All units of the same type have the same content model. All units that occur in content therefore must relate to this content model for the containing unit to be conforming to the model.
The model of a unit is a tree of MODEL-UNITs (or simply model units). Model units are nodes, as they are arranged in a strict hierarchy of other model units.
A model unit may denote declared or modelled content.
Content may be declared to contain character data only (DATA-CONTENT), be empty in all cases (EMPTY-CONTENT), or contain any kind of element known to the document (ANY-CONTENT).
If content is modelled the content model is represented by a CTOKEN instance (for content token).
There are 6 model units for model groups: AND-GROUP, OR-GROUP, SEQ-GROUP, OPT-GROUP, REP-GROUP, PLUS-GROUP. These are subtypes of (MGROUP).
There are 2 model units representing primitive content tokens: ETOKEN, for element token, and PCDATA, for parsed character data. These are subtype of PTOKEN.
The terminal nodes of a MODEL-CONTENT are always PTOKENs.
The model may be accompanied by a set of inclusions, i.e. ETOKENs describing elements that may occur as any descendant unit.
The model may be accompanied by a set of exclusions, i.e. ETOKENs describing elements that may not occur as a descendant.
When a unit of a type thus predicted is inserted, it links to an origin. The origin is the model unit that allowed the unit to appear as a child unit. Units may originate from some primitive content token in modelled content, declared content or an inclusion set. The recorded origin must therefore be an ETOKEN, PCDATA, DATA-CONTENT, ANY-CONTENT or none.
Units that do not relate to the content model are accepted on a temporary basis. These units are called visitors, that have no origin.
Only units that originate from declared or modelled content can predict what units may appear on their left or right.
Origins can only be assigned when the content model is valid. A content model is valid when it is as simple as possible, and when it is not ambiguous. On the basis of rules for simplification and disambiguation it can be decided if the model is overly complex or ambiguous. These rules are implemented by model groups (MGROUPs).
A units' content is model incomplete when some unit in content has no origin. In all other cases the unit is model complete.

Character data

Unstructured text sections covered by an element is represented by CDATA (or cdata, for character data) units.
A cdata unit knows how to act when some unit is inserted in the character sequence it covers.
A cdata unit occurs where, at that location, character data occurs or could occur. Cdata units therefore appear where the content model of the parent element allows character data to appear.
Elements that are either declared to be empty in all cases, and those having element content do not have a cdata child unit. All other elements have at least one cdata child unit.
No two cdata units can be adjacent unless they have a recorded status.

Notation

NOTATIONs represent external objects placed in element content. They are a vehicle for referencing such an external object within the document tree.
The external object may be any object that can exist within the system. SGML defines exactly one such object: the SGML document. All other external objects (images, pieces of sound, files) are outside the scope of an SGML system.
Notations originate in DATA-CONTENT or PCDATA.

Document

A DOCUMENT is an object that holds all information shared by all units within the document tree.
The document holds the root of the tree of units, a link with the TEXT instance managing the text of the document, and tables for recording the names of units in the document tree, names of external objects, and status names.
Element names (SGML ID values) are associated with element instances within the same document.
External object names (SGML ENTITY values) are associated with external objects. External objects cannot be part of any document tree.
Status names are associated with the values 'include', 'ignore' or 'temp'.
The links with a text instance and the root of the document tree are immutable.

Text

Documents implement a view on a sequence of data characters, called the text.
The text is managed by a TEXT instance. This instance renders services for accessing and updating the text.
The text instance communicates with units by pseudo-data objects, strings and patterns passed forth and back.
The complete text managed by the text instance is covered completely by the document element. Any unit covers a part of the text covered by the parent element (sub-unit coveres sub-text). All child units in sequence cover all text covered by the parent unit.
If a unit is null it covers 0 characters.
If an element is empty it has no descendants.
If an element is empty, it is automatically null.
If an element is not empty, it may still be null. In that case its descendants are all null, or are notations.
Notations are null units: they have a precise location in the text but always have a character size of 0.
The sub-text covered by a unit is identified by a region. A region is the start position of that sub-text, and the length of the sub-text.
A relative region is the start and length of the sub-text relative to the sub-text covered by the parent unit.
When a region is communicated with some other object a PDATA instance (pseudo-data, or simply pdata) is passed. A pdata is a relative region within some unit, which is called the base of the pdata.
External objects placed in content, represented by NOTATIONs, are unrelated to the document structure, and therefore the text they possibly cover is not part of the text of the document the notation is part of.
Changes in the text are the result of an operation required by a unit that is part of a document that covers that text.

Concurrency

Several documents can cover precisely the same text in all circumstances. In that case the documents are concurrent.
The text instance maintains a link with all document elements of all concurrent documents implementing a view on that text.
There is no connection between two concurrent documents other than through the text they both cover.
The text itself does not reflect the structure of any document that covers that text. Text and structure(s) are completely separated.
Because all concurrent documents must cover the same text, changes in the text are reflected in all concurrent documents. Such alignments are expressed in update notifications passed from the text to the concurrent documents.
Documents may reorder the units they contain as a result of an update notification.