[Archive copy mirrored from: http://www.hightext.com/IHC96/ek8.htm, text only - and the graphics tell a lot!]
W. Eliot Kimber, Senior SGML Consultant and HyTime Specialist
Passage Systems Inc.
Presented at HyTime '96, Seattle, WA, August 20 and 21, 1996.
<- Data Structures -> Any data processing application requires some definition of the data structures and data objects to be processed. This definition can be more or less formal depending on the needs of the application, how widely it will be used, what other systems will interact with it, and so on. The more general and complex the application, the greater the need for formal definitions of data structures.
These definitions provide a clear and, to the greatest degree possible, unambiguous description of object and data types, the properties of objects, and the possible relationships between objects. Given such formal definitions, implementors and users can more easily understand and work with objects. Implementors can implement the support for objects more consistently. Agreements between groups of users and implementors can be captured as formal object definitions, making it clear what the agreement was.
Event Schedule <- Hyperlink -> Location Ladder The objects HyTime works with are the structures within SGML documents and other kinds of data: SGML elements, parsed data strings, and identifiable structures in other data notations, as well as some of the transient side effects of hypermedia processing, such as location ladders , hyperlinks , and the results of projecting and rendering events in event schedules . These tend to be complex data objects that can have many different equally-useful and correct representations as objects.
Interoperation In the case of SGML documents, there are many different data elements that result from SGML parsing. The SGML standard defines the source syntax and parsing rules for SGML documents but it doesn't define how the result of applying those parsing rules should be represented or even, necessarily, what the precise result is. However, HyTime works with the parsed result, not the original source itself (remember that the location source for any HyTime location address is ultimately a node in a grove). Because HyTime is an enabling application architecture intended to enable the interoperation of a wide variety of different engines, processors, and applications, there must be agreement on how the objects HyTime works with look.
In addition, there must be a formal mechanism for describing the properties semantic location addresses work with. In other words, before you can have a property location address or a query, you must define what the properties are.
<- Extended Facilities -> <- SGML Extended Facilities -> <- DSSSL -> Of course, the definition of objects and their properties is of utility to a wide variety of applications. In particular, the DSSSL standard works with much the same set of objects and properties that HyTime does, as they both work from SGML as a starting point. Therefore, the SGML Extended Facilities annex includes constructs for declaring and describing objects and their properties. The DSSSL and HyTime standards then use these facilities for the definition of objects resulting from parsed SGML documents and objects unique to each standard. Other standards and applications can use these facilities as well.
<- DSSSL -> Because these formal object and property definitions are primarily for the purpose of documenting application and architecture designs, it is not necessary to understand the syntactic details of property sets unless you are implementing HyTime (or DSSSL) processors or want to use this formalism in your own application or architecture designs.
<- DSSSL -> The following sections explain property sets and the groves that result from parsing or processing data according to those property sets. The property set and grove formalisms provide the fundamental conceptual underpinnings for HyTime, DSSSL, and, potentially, any other SGML architecture or processing application.
Lexical Type Property Set In the discussion of groves under section DEFAULT-HYTIME-VIEW, you were introduced to the concept of objects and properties that result from parsing data into the internal data structures that processing applications work with. The objects that make up a grove are defined in a property set, which defines a set of objects and their properties. A property set can also define data types, such as "string" or "integer", and lexical types . The data types and lexical types are then used in the definition of properties.
TechnoTeacher Property sets serve primarily as documentation for a system and need not be processed by a HyTime engine (although they can be). While any application that uses general facility property sets should include them in its documentation, it need not provide them as part of the executable program. Presumably the objects will be embodied in the program itself. It may be possible, depending on the nature of the objects and the application, to automatically derive program objects or data schemas from general facility property sets, but they were not particularly designed to enable such processing.[Footnote 8. TechnoTeacher, Inc. has built a general property set processor, PropMinder, that processes Extended Facilities property sets and generates schemas for different object-oriented databases. It can also generate skeletal object-oriented program code that can then be completed by a programmer.]
Grove Plan Maximal Grove Plan Parsing Context From a property set, you then define one or more grove plans , which describe how the objects derived from the processing of a particular data content notation are built into groves. A grove plan associates a property set with a parsing context . The grove plan also indicates what object types should or should not be included in the grove (for example, in an SGML grove, you may not need all the objects and properties related to the original markup string). For every property set there is an implicit maximal grove plan that includes all objects and properties.
<- DSSSL -> PSDR Annex Property Set Definition Document Property sets are defined in property set definition documents , SGML documents conforming to the property set definition document type defined in the PSDR annex . The HyTime and DSSSL standards both include the SGML property set definition document. The HyTime standard also includes the HyTime property set definition document.
Object-Oriented Databases Open Hypermedia All processing systems must build in memory their own representations of the data they operate on. In practice these representations may take many different forms: arrays, variable pools, relational tables in a database, objects in an object-oriented database, and so on. For ad-hoc processors or processors acting on single-use or proprietary data formats, it is not usually necessary to formalize these data structures beyond whatever is needed to support the development and maintenance of the program itself. However, SGML and its related standards exist in part to enable the interoperation of a wide variety of tools and systems. HyTime, in particular, exists to enable the interoperation of a wide variety of tools and systems, such as will be found in a distributed, networked, open hypermedia environment.
<- Data Structures -> Thus these standards need a general, system independent formalism for defining and referring to "in-memory" data structures. Property sets are the first part of this definition: They provide a language for defining object types and properties and the allowed relationships among them. Groves are the other part of this formalism: they define how the objects defined in property sets are represented during processing and how different groves can be related together. In particular, groves define a regular and predictable structure for data so that it can be addressed reliably given knowledge of the property set used to produce the grove. Two processors working on the same data with the same property set and grove plan should produce identical groves. Other programs, communicating with these programs in terms of the grove should get the same results from both programs.
<- Processing Model -> <- CORBA -> Groves provide an abstract data and processing model . Actual programs need not implement groves literally in their own data structures as long as they can provide the correct results. It can be useful to think of programs providing a "grove view" of their internal data structures. Grove views make it possible for programs to communicate with each other by using the common language of groves and property sets, regardless of what their own internal data representations are. This approach is similar to application standards such as ODBC and CORBA, which define common object models and APIs for specific types of applications. The difference is that groves and property sets are, like SGML, a meta-mechanism for defining common object models, not the definition of a common object model directly.
Grove Root <- Hypergrove -> Origin Property Property Assignment Node A grove is formally defined as a directed graph of nodes. Nodes are ordered sets of property assignments . A grove has exactly one grove root , which is that node within a grove that has no origin property . Groves may be related together to form hypergroves .
Atomic Data Value <- Property -> Named Node List Nodal Node List Node-valued Non-nodal Each node in a grove exhibits one or more properties, as defined in the property set used to construct the grove. Properties are either nodal or non-nodal. Nodal properties consist of single nodes or node lists. Non-nodal properties contain atomic data values, such as integers, strings, and Boolean values. There are three types of nodal properties: node-valued , node lists , and named node lists .
<- Property -> Node-valued properties <- Node -> Node-valued properties are properties whose values are always a single node. Node lists are properties whose values are lists of nodes. Named node lists are node lists where each node has a name that is unique within the node list. Node lists may consist of zero or more nodes. The nodes in node lists are always ordered so that they can be addressed by position within the list.
Acyclic Directed Graph Relationship Types Subnode Relationship Type Iref Relationship Type <- Node -> Uref Relationship Type Nodes can be associated with a particular node list by one of three possible relationship types : subnode , iref (internal reference), or uref (unrestricted reference). Nodes with a relationship type of "subnode" are directly contained by the node list. The node that exhibits the node list property is said to be the origin of the nodes in the node list. Each node has exactly one origin (except for the grove root, which has no origin). This means that a node occurs in exactly one node list as a subnode. (Technically, the subnode relationships in a grove define an acyclic directed graph .)
<- Cross-Document Address -> <- Hypergrove -> <- Iref -> <- Node -> Nodes in a node list with a relationship type of "iref" are in the same grove as the node that exhibits the node list property, but have a different node as their origin. Iref relationships represent things like SGML ID references. Nodes in a node list with a relationship type of "uref" may be in the same or different groves. Uref relationships between groves create hypergroves . Uref relationships represent things like HyTime cross-document addresses .
Grove Node <- Picture -> Figure 1 shows a typical node in a grove:
<- Node -> Figure 1. Typical Node in a Grove
Each node is nothing more than a collection of properties and their values.
<- HyTime Engine -> <- Grove Constructor -> <- Grove Plan -> <- SGML Property Set -> <- Semantic Grove -> Groves are said to be "constructed" by grove constructors according to a grove plan . Grove constructors are processors that take as input either the result of parsing data or another grove or groves and produce a new grove. For SGML, the input to a grove constructor would be the output of an SGML parser and the output would be an SGML document grove as defined by the SGML property set . For HyTime, the input to a grove constructor would be one or more SGML document groves and the output would be a HyTime semantic grove as defined by the HyTime property set . In a real system parsers and grove constructors may be bound together. For other types of processors, such as HyTime engines , grove construction is simply part of what they do. Figure 2 shows the construction of an SGML document grove.
<- Grove Constructor -> <- Picture -> Figure 2. Construction of an SGML Document Grove
<- Processing Model -> The grove abstraction implies a simple processing model revolving around the creation and interconnection of groves. Regardless of what real systems actually do, it is useful to model the processing of SGML documents as the creation of groves in order to define the processing needed for a particular task without regard to implementation details. Once a satisfactory grove-based model has been defined it can be translated into specific implementation designs where optimizations and shortcuts can be applied.
<- Architectural Grove -> Auxiliary Grove Data Tokenizer Derived Grove Parse Grove <- aGrove -> <- pGrove -> In an SGML world, processing always begins by parsing an SGML document. The result of this parsing is called a parse grove or pGrove , the grove that results from parsing. A pGrove will then be processed to produce other groves or some non-grove output. When architectures are being used a pGrove can be processed by a generic architecture engine to produce a grove representing the architectural instance derived from the base document, the architectural grove or aGrove . For example, a HyTime document would be parsed into a pGrove. The pGrove would then be processed to derive the document's HyTime aGrove, as shown in Figure 3. Groves created from other groves that have (or can have) an independent existence are said to be derived groves . Derived groves that are created for specific processing purposes and are not independent of the groves from which they are created are called auxiliary groves . For example, the grove that results from applying a data tokenizer to the content of an SGML element is an auxiliary grove, whereas architectural groves are derived groves.
<- Architectural Grove -> <- Picture -> <- pGrove -> Figure 3. Creating a HyTime Architectural Grove from a pGrove
<- Client Document -> <- aGrove -> Because architectural groves are inherent in client documents , it is useful to assume that there is always an aGrove present, whether or not actual processing systems are implemented that way. Architecture-specific processors are then assumed to take as their initial input the aGrove for their architecture, rather than the client documents themselves.
<- aGrove -> <- pGrove -> A processor can always get from an aGrove to the pGrove from which it was derived because each node in a derived drove has the intrinsic property "source", which is the node or nodes from which it was derived. For example, in a HyTime architectural grove, each element node would have as its source the node in the client pGrove from which it was derived, as shown in Figure 4.
<- Picture -> Source Property <- Node -> <- aGrove -> <- pGrove -> Figure 4. Nodes in an aGrove Derived From Nodes in a pGrove
Finite Coordinate Space
HyTime Property Set
HyTime Semantic Grove
Architecture-specific processors must maintain their own semantic groves
, which hold those objects directly
related to the processor's semantics. For HyTime engines
, the HyTime semantic grove
holds the objects defined
in the HyTime property set
. The objects in the HyTime
semantic grove may be derived from many different nodes derived from many
documents. An event node, for example, would be derived from
extlist-form elements in a finite coordinate space
. The extent property
would be derived from the various elements making up the event's extent
, and so on.
Figure 5 shows the construction of a
semantic grove from a pGrove and an aGrove.
<- HyTime Semantic Grove -> <- Picture -> <- Semantic Grove -> Figure 5. Construction of a HyTime Semantic Grove
<- HyTime Engine -> <- Client Document -> Content Location <- Effective pGrove -> <- Hypergrove -> epGrove <- pGrove -> A HyTime engine uses the HyTime semantic grove, along with the other groves in the hypergrove of which the semantic grove is a member, to do whatever processing it needs to do. This processing includes the construction of a new grove for the original document reflecting the effective results of HyTime-specific processing. For example, if the content location facility is used, the HyTime engine must resolve any content locations to determine the effective content of those elements before it can resolve any location addresses. This new grove is the effective pGrove , or epGrove of the client document . Figure 6 shows an epGrove being produced from the other groves in the hypergrove.
<- Effective pGrove -> <- Picture -> <- pGrove -> Figure 6. Creation of an Effective pGrove
An actual application would probably not literally create a new in-memory representation of the client document, but would just augment its existing representation. However, it's easier to talk about the abstract processing if it is represented as a separate creation process. To keep the grove abstraction simpler and to make location addressing more tractable, groves are considered to be static once created. There is no notion of changing a grove once it has been created. In particular, the grove position of a node cannot change once it has been set in a grove. In the abstract processing model change is represented as destruction of the old grove followed by creation of a new grove. Actual applications can, of course, have more dynamic real data structures.
<- Extended Facilities -> Groves are an abstraction designed to enable inter-standard and inter-application interaction. Because they represent real data and because the nature of that data is well defined through property sets, it is possible to define a canonical representation of groves. The Extended Facilities annex defines this canonical representation using an SGML document type and a set of severe constraints on how the source documents are organized such that a given grove can produce one and only one string representation of the grove. This string representation can then be used to do string comparisons of groves. This can aid in checking processors for conformance and in debugging. The canonical grove representation can also be used to interchange groves among processors, if necessary. Canonical grove representations, when compressed using normal compression techniques, could also serve as a binary form of SGML documents and application-specific data structures (semantic groves), which, once decompressed, would be usable by any grove-based applications.
A typical canonical grove looks like this:
CONCEPT Canonical Grove Representation MENTION <!DOCTYPE GROVE PUBLIC "ISO/IEC 10744:1992//DTD Canonical Grove Representation//EN"><GROVE><NODE CLASS="sgmldoc"><NODEPROP ID="X1" DATATYPE="nnl" RCSNM="prop1" NODEREL="SUBNODE"> property value</NODEPROP><NODEPROP ID="X2" DATATYPE="nnl" RCSNM="prop2" NODEREL="SUBNODE"> property value</NODEPROP><NODE ID="X3" CLASS="foo"><NODEPROP ID="X4" DATATYPE="nnl" RCSNM="prop2" NODEREL="UREFNODE"> property value</NODEPROP><NODEPROP ID="X5" DATATYPE="nl" RCSNM="prop2" NODEREL="IREFNODE"> property value</NODEPROP></NODE><NODE ID="X6" CLASS="bar"><NODEPROP ID="X7" DATATYPE="string" RCSNM="prop2" NODEREL="ATOMIC"> property value</NODEPROP><NODEPROP ID="X8" DATATYPE="nnl" RCSNM="prop2" NODEREL="SUBNODE"> property value</NODEPROP></NODE></NODE></GROVE> CONCEPT DTD MENTION
<- Cross-Document Address -> Subnode Relationship <- Iref -> Uref Relationship The basic rules for canonical grove documents are that each start and end tag is on a line by itself and attribute values are always enclosed in literals. Subnode relationships are represented by direct containment. Iref and uref relationships are represented by ID references. Every element is assigned an ID using a fixed algorithm of numbering nodes sequentially in a depth-first, left-list traversal of the subnode graph of the grove. Uref relationships cannot be represented by working cross-document addresses because there is no way to consistently declare the other groves as documents within the grove document. Thus, urefs are simply numbered sequentially in the order they are encountered in the grove.
<- HTML -> This HTML document created from the original SGML using Panorama style sheets. Subsequent modifications done with HoTMetaL PRO 3.0.