SGML: Groves (Kimber)



Subject: Re: Groves?

Date: Fri, 30 Aug 1996 11:59:39 -0500

From: "W. Eliot Kimber" <kimber@passage.com>

Newsgroup: comp.text.sgml



--------------------------------------------------------------------

> Steve Tinney <s@eryops.walnut.lane> wrote in article
<m2zq3d90g5.fsf@eryops.walnut.lane>...
> 
> [len bullard wrote]
>   The work on groves is significant.
> 
> I don't understand this or other references to `groves' in your
> interesting article.  Since you invited questions to CTS, could
> you or someone else explain, or give a useful URL?

Here is my attempt to explain groves with a minimum of detail. As Steve
Newcomb mentioned in his related post, the HyTime conference materials
will be available from the GCA web site (www.gca.org). This will include
my presentation on groves, which is excerpted from the current draft of my
forthcoming HyTime book.  

A number of us are still struggling to find the most effective way to
explain groves to different audiences. Therefore, if you find the
following explanation unhelpful, let me know and I'll try again.  Groves
seem to make intuitive sense to programmers but not to non-programmers.
The following explanation is intended primarily for non-programmers. 

At the conference, Murry Maloney asked question "who needs to know about
groves?"  My answer is:

A. If you are just an author of SGML documents, you probably don't
   need to know about groves at all, even you will be using 
   basic HyTime linking and addressing in your documents.

B. If you design SGML document types and document architectures,
   you don't necessarily need to know anything about groves
   and property sets, but you might find it useful to be able
   to define the semantic objects your document types or
   architectures represents in grove terms.

C. If you are an author who will be doing out-of-the ordinary
   things with location addressing in HyTime (or another
   grove-based addressing scheme), then you need to understand
   the basics of groves because all location addresses
   ultimately address nodes in groves. In particular, you need
   to understand how to predict a node's address given the
   rules for constructing the grove you are addressing.  This
   is especially true if you will be using queries for addressing.

   In addition, there are features of HyTime that appear to 
   be "magic" unless you understand their underlying grove-based
   definition.  If you aren't willing to take it on faith that
   this magic is not arbitrary, you need to understand groves in
   order to understand the definition of the magic. This is
   analogous to having to learn number theory to understand why
   it's really the case that 1+1=2.

D. If you are creating DSSSL style sheets "by hand" (as opposed
   to through some interface that hides the details), you need
   to understand groves and property sets because DSSSL processing
   operates on groves.  This is analogous to having to understand
   for DynaText how it represents documents internally so that you 
   can build complex property value functions.

E. If you are trying to understand the functional relationship
   between HyTime and DSSSL, you need to understand groves because
   it is through the grove mechanism that these two standards
   have been rationalized.

F. If you are building SGML tools that you expect to interoperate
   with other tools (editors, browsers, document managers,
   etc.), you need to understand groves because groves can act
   as a common data API by which different tools communicate
   about the documents they operate on.

G. If you are designing query languages or specialized addressing
   notations for use with HyTime or DSSSL, you need to understand
   groves so that you can define your language in grove terms.

In brief: if it is enough for you to be told that "groves make it all
work" then you don't need to know more. If you want to understand the
formalism that does make it work, read on.

NOTE: In the discussion that follows, I refer to the "corrected HyTime
standard". This is the HyTime standard as corrected by the forthcoming
Technical Corrigendum (TC). In ISO procedures, a TC is a correction,
rather than a revision, being essentially a list of errata. Thus the
HyTime standard is being corrected rather than revised. 

                A Brief Introduction to Groves

================
What Is A Grove?

To oversimplify, a "grove" is the in-memory result of parsing a document.
Groves consist of "nodes" and each node is simply a set of properties. For
example, an "element" node has properties like "gi", "children", "markup",
and "attribute list".  The values of some properties may serve to connect
nodes in the grove together into a graph. Each node has a defined class.
Each grove has a single "grove root" node.  Groves may be related together
to form a "hypergrove". 

Groves provide an abstraction that lets us talk about the processing SGML
tools do without worrying about implementation details. Groves and their
construction let us describe general processing models from which actual
implementations can be derived without having to worry that the
implementations have accidently overlooked important details during
optimization.  In particular, it gives system developers a formalism for
saying "this is the information we care about and this is the information
we don't care about" because groves define the total set of possible
information an SGML application might work on. 

Groves are called groves because they are collections of trees. 

We tend to think of SGML documents as being trees, but in fact, there are
a number of different trees within SGML documents. For example, there is
the tree rooted at the document element (what we usually think of as the
"document tree"). The doctype type declaration can be thought of as a
tree. There are other parts of documents that are not trees, but are
simply objects of different types related to their origin objects. For
example, an SGML document consists of a prolog and an instance, but we
don't normally think of the prolog as being a child of the SGMLDOC object.
Rather, the prolog is simply a property of the SGML document.

In other words, it is useful to be able to say that some parts of the
grove form trees and that others don't. In particular, it enables the use
of tree-based addressing without having to worry about whether or not the
inclusion of optional objects or properties will change the tree position
of objects. For the tree rooted at the document element, we usually think
of the tree as consisting only of elements and their data content but not
things like processing instructions or comments. The grove mechanism lets
us say explicitly that PIs and comments are not part of the tree even if
they are included in the grove. 

When you "construct" a grove, you can choose to include or omit certain
objects and properties from the grove. If your processing never needs to
look at comments, for example, you can simply omit comments from your
grove entirely, saving processing time and memory space. Because you can
also say that PIs are not part of the instance tree, you know that
excluding PIs will not change the tree locations of elements in the tree.

Thus, you can think of a grove as a "parse tree," but it's a bit more than
that because parsing an SGML document doesn't just produce a tree, it
produces a lot of other things that we don't normally think of as being in
a tree. [That said, there is a view of a grove in which the grove *is* a
tree in the formal sense of being an acyclic directed graph where all the
arcs relate children to their parents. This is called the "subnode tree".
The definition of subnode trees is beyond the scope of this discussion.]

Thus, a grove is the in-memory data structure that reflects the objects
recognized in source data by a parser. It does what ESIS tried to do:
define what the result of parsing is; but does it completely and in a way
that is tailorable to the needs of particular processors.  You can think
of groves as a sort of data API for SGML processing systems where the
grove becomes an agreed upon representation of the data all SGML
processors need to operate on.

If you imagine that a grove exists as a data structure that multiple
applications can read at the same time, you can see that groves provide a
common data abstraction that allows different applications to talk about
the data in SGML documents the same way. Thus, if an editor wants to tell
a browser to view a particular element, the editor can talk to the browser
in grove terms, knowing that the browser will always understand it,
regardless of how the browser works internally.

The key things to remember are:

o A grove is an in-memory representation of the result of 
  parsing. This is an abstract representation in the sense
  that it is not intended that real systems actually use the
  grove structure literally for their own internal memory.

o The grove acts as a *lingua franca* by which different
  applications can communicate with each other about documents.

o Groves can be used to model the processing of SGML documents
  independent of implementation issues.  Different
  implementations can then be derived that have whatever
  optimizations they need but that reflect the general model
  and for which the mapping from internal structures to 
  grove structures is explicit.


========================
Groves and Property Sets

Each node in a grove has a defined class and serves as a container of
properties. This begs the question of where these classes and properties
are defined.  In order to actually work with groves you have to know what
classes and properties to refer to. For example, to create a query that
will find elements with a particular attribute value, you have to know
first how to refer to the list of all the elements in a document and then
how to refer to the attributes of elements.  While it may seem intuitive
to say "elements have attributes" and leave it at that, it's not that
simple in practice.

The classes and properties that an SGML document represents are not, for
the most part, explicit in the SGML standard. Rather, the standard defines
the rules for recognizing delimiters and data in the SGML document string
and what those delimiters represent.  The SGML standard lacked a formal
definition of what the abstract objects and properties were that the
syntax of an SGML document represents.  Some aspects of SGML documents,
like elements and their attributes, were pretty obvious, but many other
aspects were either not obvious or could be interpreted in different ways.
In particular, the SGML standard says nothing about what things in an SGML
document instance are to be considered part of the "document tree".  There
are also places where things could be related together in a number of
different ways. Thus, arbitrary choices needed to be made about how to
relate some things together.

Before groves, both DSSSL and HyTime had a notion of there being an "SGML
property set" that defined formally what the objects and properties were.
It became clear that there needed to be a single, common property set for
SGML documents from which DSSSL and HyTime (and any other SGML processor)
could work.  This property set was defined by the SGML, DSSSL, and HyTime
teams. [The property set is based on much deep thought done by Lynn Price,
Dave Peterson, and other long-time members of the SGML team.  James Clark
was instrumental in working out the details of property sets and groves
and tested the designs in his SP parser, the results of which you can now
see in the SP code James makes available (www.jclark.com). Peter Newcomb
and Derek Denny-Brown of TechnoTeacher have done much to test the grove
design through their work on a new version of the HyMinder product, which
will be fully grove based.]

The SGML property set is published in the DSSSL standard (ISO/IEC
10179:1996) but used by both DSSSL and HyTime. The property set defines
formally all the object classes and their properties. The output of any
SGML parser can now be defined in terms of these objects and properties.
You can think of this as ESIS++: a formal definition of what things a
parser will tell you about an SGML document. Unlike ESIS, the SGML
property set is complete, reflecting everything explicit or implicit in
the SGML standard. 
The SGML property set will become part of the formal specification of SGML
when it is revised.

Property sets are defined using an SGML document conforming to the
property set document type defined in annex C of the corrected HyTime
standard.  

A property set document consists of a set of class definitions. Each class
then has a set of associated property definitions. Property definitions
may refer to new data types and lexical types defined in the property set.
 Property sets are organized into one or more modules. Modules typically
group related classes or properties together to make it easier to include
or ignore groups of classes or properties in a particular grove.

Groves are said to be "constructed". A grove constructor is a program that
takes as input the output of a parser (or other processor), a property
set, and a "grove plan". The grove plan says what classes and properties
from the full property set to include in the constructed grove. For
example, if your processor doesn't care about PIs, you could define a
grove plan that simply excludes the PI object class (and therefore its
properties as well) from the grove.

How a grove plan is specified may be different for different processors.
For example, in HyTime, grove plans are specified using an element type
form defined in the HyTime standard but other processors could use a
different mechanism.  For every property set there is an implicit "maximum
grove plan" that includes all classes and properties.

=============================================
Application-Specific Property Sets and Groves

Groves and property sets are not limited to the task of parsing SGML
documents and their subsequent processing. Applications can define
property sets that reflect their own unique needs and then create groves
reflecting that data.

For example, a HyTime engine processes SGML documents and then creates new
objects unique to HyTime, such as hyperdocuments, hyperlinks, location
addresses, and so on. These objects reflect the *semantics* of HyTime,
rather than the syntax of SGML documents. Thus, the HyTime standard
defines a unique HyTime property set that reflects the semantic objects a
HyTime engine works with. This property set is distinct from the SGML
property set.

The HyTime property set is then used by a HyTime engine to create "HyTime
semantic groves" from the documents in a HyTime hyperdocument. You can
imagine that a HyTime engine takes as input one or more SGML document
groves (produced by an SGML parser/grove constructor) and then produces
its own HyTime semantic grove.

This process is analogous to what any hyperlink management system does
today when it processes a bunch of documents and creates its own data
structures for representing the links among the documents. For example,
many such systems create relational database tables that record the links
between documents. The HyTime semantic grove is simply a formalized,
abstract representation of this, separated from implementation details.  

Note that given the HyTime property set, a HyTime semantic grove, and the
grove plan by which it was constructed, any HyTime-aware processor would
be able to use that grove to do HyTime-specific processing. In particular,
you would be able to query it using a grove-aware query language (such as
SDQL from the DSSSL standard) to ask questions about links and location
addresses *because the objects and properties are standardized*. You would
not need to know how a particular HyTime engine had chosen to structure
its internal data. 

Note also that any HyTime engine can provide a "grove view" of its
internal data structures if it wants to. This means that the grove can
again act as common data access API among different HyTime applications,
independent of how they are implemented and optimized under the covers. 

Any application or architecture can define its own semantic property set
and therefore define its processing in terms of groves constructed
according to that property set. Grove construction is defined in a "grove
construction document". The grove construction document is intended
primarily for consumption by human programmers. For each class and
property in the property set, it defines the rules by which they are
constructed. These could be simple prose instructions for a programmer to
read and follow or they could be more programmatic.  In the abstract, the
construction of semantic groves could be defined by specifying the SDQL
queries to be applied to the input groves where the results of the queries
would be objects in the new grove (SDQL queries could include functions
that perform arbitrary processing).

Thus, for a complex document type or architecture, property sets and grove
construction documents can be useful simply as a formalism for design
specification, even if there is no intention to actually process these
specifications programmatically.

===========================
What More is There To Know?

I have not talked about the details of how classes and properties are
defined. I have not talked about how the nodes in groves are related to
each other or how the values of properties are defined and used. These are
important details that only implementors and people doing sophisticated
queries need to understand.  I have not talked about views on groves,
derived and auxiliary groves, hypergroves, and the like.

I do explain these details in my forthcoming book.

-- 
<Address HyTime=bibloc 
         homepage="http://www.squirrel.com/squirrel/drmacro">
W. Eliot Kimber, kimber@passage.com 
Senior SGML Consultant and HyTime Specialist
Passage Systems, Inc., 10596 N. Tantau Ave., Cupertino, CA 95014-3535 
(408) 366-0300 (Cupertino), (512) 339-1400 (Austin), 
http://www.passage.com </Address>
"If I never had existed, would you still remember me?..." 
--Austin Lounge Lizards, "1984 Blues" 
(http://www.webcom.com/~yeolde/all/lllhome.html)