Groves: an illustrated example

Two informative CTS postings by Eliot Kimber
Subject: Re: Groves:  an illustrated example

Date: 25 Mar 1997 15:42:45 GMT

From: "W. Eliot Kimber" <eliot@isogen.com>

Newsgroup: comp.text.sgml


          --------------------------------------------------

Chris McCauley <cmccaul@cris.com> wrote in article
<33373f8c.2280486@news.cris.com>...

> Next, what is DSSSL and what is the relation to SGML?

DSSSL is International Standard ISO/IEC 10179:1996, Document Style
Semantics and Specification Language.  It is a companion to the SGML
standard.  It provides a specification language (derived from Scheme) for
creating both presentation style specifications ("style sheets") and
tree-to-tree transforms.  It is designed specifically to operate on SGML
documents, but is not necessarily limited to that application.
 
> Finally, dare I ask about groves?

To oversimplify, DSSSL specifications operate on trees of nodes to create
new trees (which may either represent a formated result, the "flow object
tree" or a new SGML document {when doing transforms}).  The nodes are
organized into a specialized data structure we chose to call "groves"
(because they are "independent trees and other stuff", which is more or
less what a real-world grove is).

Technically, a grove is a directed graph of nodes (possibly cyclic when
referential relationships are taken into account).  Each node exhibits one
or more "properties", which may be "atomic" values (strings, booleans,
integers, etc.) or lists of nodes or references to nodes. (Thus a node is
nothing more than a set of properties.) 

Informally, a grove may be thought of as a "parse tree."  In short, a grove
is an ordinary "graph of nodes with properties" data structure specialized
to meet the specific needs of the DSSSL and HyTime standards.  Groves
include special relationship types and some rules about how they can be
constructed.

Groves are said to be "constructed" by a "grove construction process",
which takes as input the result of parsing some data (e.g., an SGML
document) and produces as output a grove of nodes (i.e., an abstract
representation of the data structures and relationships described by the
parsed data).  In constructing an SGML grove, an SGML parser recognizes
syntactic constructs, communicates those to the grove constructor, which
then creates the appropriate node and/or properties in the grove.  For
example, when the parser sees a start tag, the grove constructor will
create an "element" node.  When the parser sees an attribute on the start
tag, the grove constructor will create an "attributes" property for the
element node, an attribute node within the attributes property (whose value
is a node list), and so on.

You can create your own grove constructor using a tool like NSGMLS (part of
SP package from James Clark).  The output of NSGMLS is essentially a set of
signals from the parser for each syntactic thing it saw.  You can could
then create a program (say in Perl), that created a literal grove in memory
that you could then operate on.  In other words, there's no magic in this
grove thing: it's just normal computer science and data processing,
abstracted a bit in its specification, specialized for SGML, and made
implementation independent.

The node classes and properties for grove nodes are defined in a "property
set definition document" (psdd), which is nothing more than a formal
specification of the classes and their properties.  The syntax and
semantics of property set definition documents are defined in the
soon-to-be-released corrected HyTime standard (ISO/IEC 10744:1992), but
they're pretty obvious once you learn what all the attribute names mean.

The DSSSL standard contains the property set definition document for the
SGML standard, which will also be in the corrected HyTime standard (and
eventually in the revised SGML standard).  In addition, the HyTime standard
defines property sets for HyTime's semantic objects (the nodes a HyTime
engine operates on after it as applied HyTime semantics to SGML documents
and other data).  

Property sets can be defined for any data notation and thus
application-independent groves can be constructed according to those
property sets.  For example, given a property set for something like RTF it
would be possible to build an "RTF grove" from Word documents.  Once you
had such a grove, you could apply grove-based processing to it, including
DSSSL style specs and transforms, as well as HyTime location addressing and
hyperlinking.  However, there is no guarantee that the property set and
grove mechanism is capable of expressing all aspects of any notation other
than SGML.

Property sets and groves provide an implementation-independent data
abstraction that disparate tools and specifications can communicate in
terms of without the need for any tool to actually use that abstraction
under the covers.  For example, most SGML editors create their own unique
tree view of SGML documents, which is what the editor then operates on. 
Today, each editor's tree view is different because the developers made
different, sometimes arbitrary, choices about how to represent SGML
documents in memory.  The SGML data model is complex enough that there are
many possible, equally-useful, ways to represent the result of parsing. 
However, because these tools all operate on SGML, their different views are
similar enough that they could all provide an API that behaved as if they
were using the SGML property set (and thus the tree structures that result
from it) under the covers.  This would have the advantage of providing a
common access API for SGML documents that would be independent of the
implementation choices each tool made.  SGML document management systems
could do the same thing.

[Hint: there's no difference between an SGML document manager that manages
documents at the element level and an SGML editor except how the user
interface looks.  Any editor can be made into a document manager and any
such document manager can be made into an editor.]

The HyTime standard also relies on groves and property sets in that all
HyTime functionality is defined as operations on nodes in groves, rather
than on the syntax of SGML documents.  Thus, HyTime and DSSSL share the
same fundamental abstract data view of SGML documents.  This same abstract
data view could be used by any other processor, standard or proprietary.

You can find out more about DSSSL and the SGML property set at James
Clark's web site, http://www.jclark.com/dsssl

-- 

      ============ Second part of CTS response by Kimber ============

Subject: Re: Groves:  an illustrated example

Date: 26 Mar 1997 15:00:52 GMT

From: "W. Eliot Kimber" <eliot@isogen.com>



Harald Hanche-Olsen <hanche@math.ntnu.no> wrote in article
<pcod8sn8w0e.fsf@ikaros.math.ntnu.no>...
> : "W. Eliot Kimber" <eliot@isogen.com> :
> 
> | Technically, a grove is a directed graph of nodes(possibly cyclic
> | when referential relationships are taken into account).  Each node
> | exhibits one or more "properties", which may be "atomic" values
> | (strings, booleans, integers, etc.) or lists of nodes or references
> | to nodes. (Thus a node is nothing more than a set of properties.)
> 
> Hmm.  I noticed this definition in the standard.  To a mathematician
> like me, this implies that two nodes cannot exhibit the same
> properties, since then they would indeed be the same node.  This
> raises the question what the grove corresponding to the following
> document would be like:
> 
> <!-- document type definition omitted.  supply your own. -->
> <barf>
>   <section><title>First section</title>
>     <p>A paragraph</p>
>     <p>A paragraph</p>
>   </section>
>   <section><title>Second section</title>
>     <p>A paragraph</p>
>   </section>
> </barf>
> 
> The first section might contain two references to the only(!) <p> node
> in the document, but the second section creates a problem, since it is
> not clear who the parent of that <p> node would be.  No node can have
> more than one parent, can it?

No. Every node (except the grove root) has exactly one origin.  In groves,
every node has exactly one origin node.  A node is a _subnode_ of another
node when the other node is its origin.  [This is a "parent child"
relationship, but the grove design reserves the terms "parent" and "child"
for those origin-to-subnode relationships that represent the "content" of
nodes, as distinct from other properties that may contain subnodes but are
not considered content by the semantics of the data notation from which the
grove is constructed.  In SGML, this makes it possible to clearly
distinguish what we normally think of as the "content" of elements (data
characters and subelements) from all the other properties that elements
have, like attributes, markup, element types, and so on.

Each node class can designate a single "content" property.  The content
property may be either atomic, meaning that its value is a string, integer,
boolean, etc. (or list thereof), or nodal, meaning that its value is a node
list.  When the value is nodal, the property is also said to be the
"children" property of the node.  Thus, in addition to being the origin of
the nodes in the content property, the node is also the parent of the nodes
in its content property and those nodes are its children.

Given this distinction between subnodes and the subset of subnodes that are
also children, it should be clear that one can get two views of a grove:
the "subnode tree", which includes all nodes in a grove and is defined by
the origin-to-subnode relationships, and the "content tree" view, which
includes all nodes that are the root of a content tree and the nodes in the
content properties of those nodes.  Note that there may be many
discontiguous content trees in a grove.

For example, in the SGML property set, each element node has several
subnode properties, including the attributes property and the content
property.  The attributes property value is a node list of attribute nodes.
 The content property is a node list of those nodes that can occur inside
elements: datachars, elements, PIs, comments, etc.  Both the attribute
nodes in the attributes property and the nodes in the content property have
the element as their origin, but only the nodes in the content property
have the element as their parent.  When you take the subnode tree view of
the element, you will see all the nodes in all the subnode properties. 
When you take the content tree view, you will only see the nodes in the
content property.

This distinction makes it easier to write processors, addresses, and
queries that restrict themselves to the *semantic content* of the data,
which is something most SGML processing needs to do.  For example, in
HyTime, tree addressing is normally applied to the content tree rooted at
the document element--you don't want to have to account for all the
attributes and other "non-content" properties of elements when calculating
tree addresses.  By having this fundamental distinction between content
trees and other stuff, you don't have to build the distinction into the
semantics of every address, query and processor.
By the same token, when you want to use tree addressing to get at
attributes, you can do so by asking for the subnode tree instead of the
content tree.

Finally, the property set can define a particular node as being the
"principle tree root", meaning that it is the root of the content tree that
should be used by default if a specific tree root is not specified.  In
SGML, this is, of course, the document element, which is the root of the
element content tree.]

> I have come to assume that the axiom of extensionality is not assumed
> to hold, so that distinct sets may have exactly the same elements.  Or
> did I miss something?

Yes.  The nodes in a grove are ordered, such that every node is
distinguished by being enumerated within the node list property that
contains it. Thus the three apparently identical paragraphs above are
actually distinguished by their position within the document. 

In addition, the P element in the second section is distinguished from the
first two by having a different origin property (it's origin is the Section
element within which it occurs). 

-- 
<Address HyTime=bibloc homepage="http://www.drmacro.com">
W. Eliot Kimber, eliot@isogen.com
Senior SGML Consulting Engineer, Highland Consulting
2200 North Lamar Street, Suite 230, Dallas, Texas 75202
+1-214-953-0004 +1-214-953-3152 (fax) 
http://www.isogen.com (work)</Address>
"Rats in the morning, rats in the afternoon...if they don't go away, 
I'll be reducated soon..."   --Austin Lounge Lizards, "1984 Blues" 
(http://www.webcom.com/~yeolde/all/lllhome.html)