SGML: Groves: an illustrated example

Groves: an illustrated example

Two informative CTS postings by Eliot Kimber


Subject: Re: Groves:  an illustrated example
Date: 25 Mar 1997 15:42:45 GMT
From: "W. Eliot Kimber" <eliot@isogen.com>
Newsgroup: comp.text.sgml
-------------------------------------------------- Chris McCauley <cmccaul@cris.com> wrote in article <33373f8c.2280486@news.cris.com>... > Next, what is DSSSL and what is the relation to SGML? DSSSL is International Standard ISO/IEC 10179:1996, Document Style Semantics and Specification Language. It is a companion to the SGML standard. It provides a specification language (derived from Scheme) for creating both presentation style specifications ("style sheets") and tree-to-tree transforms. It is designed specifically to operate on SGML documents, but is not necessarily limited to that application. > Finally, dare I ask about groves? To oversimplify, DSSSL specifications operate on trees of nodes to create new trees (which may either represent a formated result, the "flow object tree" or a new SGML document {when doing transforms}). The nodes are organized into a specialized data structure we chose to call "groves" (because they are "independent trees and other stuff", which is more or less what a real-world grove is). Technically, a grove is a directed graph of nodes (possibly cyclic when referential relationships are taken into account). Each node exhibits one or more "properties", which may be "atomic" values (strings, booleans, integers, etc.) or lists of nodes or references to nodes. (Thus a node is nothing more than a set of properties.) Informally, a grove may be thought of as a "parse tree." In short, a grove is an ordinary "graph of nodes with properties" data structure specialized to meet the specific needs of the DSSSL and HyTime standards. Groves include special relationship types and some rules about how they can be constructed. Groves are said to be "constructed" by a "grove construction process", which takes as input the result of parsing some data (e.g., an SGML document) and produces as output a grove of nodes (i.e., an abstract representation of the data structures and relationships described by the parsed data). In constructing an SGML grove, an SGML parser recognizes syntactic constructs, communicates those to the grove constructor, which then creates the appropriate node and/or properties in the grove. For example, when the parser sees a start tag, the grove constructor will create an "element" node. When the parser sees an attribute on the start tag, the grove constructor will create an "attributes" property for the element node, an attribute node within the attributes property (whose value is a node list), and so on. You can create your own grove constructor using a tool like NSGMLS (part of SP package from James Clark). The output of NSGMLS is essentially a set of signals from the parser for each syntactic thing it saw. You can could then create a program (say in Perl), that created a literal grove in memory that you could then operate on. In other words, there's no magic in this grove thing: it's just normal computer science and data processing, abstracted a bit in its specification, specialized for SGML, and made implementation independent. The node classes and properties for grove nodes are defined in a "property set definition document" (psdd), which is nothing more than a formal specification of the classes and their properties. The syntax and semantics of property set definition documents are defined in the soon-to-be-released corrected HyTime standard (ISO/IEC 10744:1992), but they're pretty obvious once you learn what all the attribute names mean. The DSSSL standard contains the property set definition document for the SGML standard, which will also be in the corrected HyTime standard (and eventually in the revised SGML standard). In addition, the HyTime standard defines property sets for HyTime's semantic objects (the nodes a HyTime engine operates on after it as applied HyTime semantics to SGML documents and other data). Property sets can be defined for any data notation and thus application-independent groves can be constructed according to those property sets. For example, given a property set for something like RTF it would be possible to build an "RTF grove" from Word documents. Once you had such a grove, you could apply grove-based processing to it, including DSSSL style specs and transforms, as well as HyTime location addressing and hyperlinking. However, there is no guarantee that the property set and grove mechanism is capable of expressing all aspects of any notation other than SGML. Property sets and groves provide an implementation-independent data abstraction that disparate tools and specifications can communicate in terms of without the need for any tool to actually use that abstraction under the covers. For example, most SGML editors create their own unique tree view of SGML documents, which is what the editor then operates on. Today, each editor's tree view is different because the developers made different, sometimes arbitrary, choices about how to represent SGML documents in memory. The SGML data model is complex enough that there are many possible, equally-useful, ways to represent the result of parsing. However, because these tools all operate on SGML, their different views are similar enough that they could all provide an API that behaved as if they were using the SGML property set (and thus the tree structures that result from it) under the covers. This would have the advantage of providing a common access API for SGML documents that would be independent of the implementation choices each tool made. SGML document management systems could do the same thing. [Hint: there's no difference between an SGML document manager that manages documents at the element level and an SGML editor except how the user interface looks. Any editor can be made into a document manager and any such document manager can be made into an editor.] The HyTime standard also relies on groves and property sets in that all HyTime functionality is defined as operations on nodes in groves, rather than on the syntax of SGML documents. Thus, HyTime and DSSSL share the same fundamental abstract data view of SGML documents. This same abstract data view could be used by any other processor, standard or proprietary. You can find out more about DSSSL and the SGML property set at James Clark's web site, http://www.jclark.com/dsssl -- ============ Second part of CTS response by Kimber ============ Subject: Re: Groves: an illustrated example
Date: 26 Mar 1997 15:00:52 GMT
From: "W. Eliot Kimber" <eliot@isogen.com>
Harald Hanche-Olsen <hanche@math.ntnu.no> wrote in article <pcod8sn8w0e.fsf@ikaros.math.ntnu.no>... > : "W. Eliot Kimber" <eliot@isogen.com> : > > | Technically, a grove is a directed graph of nodes(possibly cyclic > | when referential relationships are taken into account). Each node > | exhibits one or more "properties", which may be "atomic" values > | (strings, booleans, integers, etc.) or lists of nodes or references > | to nodes. (Thus a node is nothing more than a set of properties.) > > Hmm. I noticed this definition in the standard. To a mathematician > like me, this implies that two nodes cannot exhibit the same > properties, since then they would indeed be the same node. This > raises the question what the grove corresponding to the following > document would be like: > > <!-- document type definition omitted. supply your own. --> > <barf> > <section><title>First section</title> > <p>A paragraph</p> > <p>A paragraph</p> > </section> > <section><title>Second section</title> > <p>A paragraph</p> > </section> > </barf> > > The first section might contain two references to the only(!) <p> node > in the document, but the second section creates a problem, since it is > not clear who the parent of that <p> node would be. No node can have > more than one parent, can it? No. Every node (except the grove root) has exactly one origin. In groves, every node has exactly one origin node. A node is a _subnode_ of another node when the other node is its origin. [This is a "parent child" relationship, but the grove design reserves the terms "parent" and "child" for those origin-to-subnode relationships that represent the "content" of nodes, as distinct from other properties that may contain subnodes but are not considered content by the semantics of the data notation from which the grove is constructed. In SGML, this makes it possible to clearly distinguish what we normally think of as the "content" of elements (data characters and subelements) from all the other properties that elements have, like attributes, markup, element types, and so on. Each node class can designate a single "content" property. The content property may be either atomic, meaning that its value is a string, integer, boolean, etc. (or list thereof), or nodal, meaning that its value is a node list. When the value is nodal, the property is also said to be the "children" property of the node. Thus, in addition to being the origin of the nodes in the content property, the node is also the parent of the nodes in its content property and those nodes are its children. Given this distinction between subnodes and the subset of subnodes that are also children, it should be clear that one can get two views of a grove: the "subnode tree", which includes all nodes in a grove and is defined by the origin-to-subnode relationships, and the "content tree" view, which includes all nodes that are the root of a content tree and the nodes in the content properties of those nodes. Note that there may be many discontiguous content trees in a grove. For example, in the SGML property set, each element node has several subnode properties, including the attributes property and the content property. The attributes property value is a node list of attribute nodes. The content property is a node list of those nodes that can occur inside elements: datachars, elements, PIs, comments, etc. Both the attribute nodes in the attributes property and the nodes in the content property have the element as their origin, but only the nodes in the content property have the element as their parent. When you take the subnode tree view of the element, you will see all the nodes in all the subnode properties. When you take the content tree view, you will only see the nodes in the content property. This distinction makes it easier to write processors, addresses, and queries that restrict themselves to the *semantic content* of the data, which is something most SGML processing needs to do. For example, in HyTime, tree addressing is normally applied to the content tree rooted at the document element--you don't want to have to account for all the attributes and other "non-content" properties of elements when calculating tree addresses. By having this fundamental distinction between content trees and other stuff, you don't have to build the distinction into the semantics of every address, query and processor. By the same token, when you want to use tree addressing to get at attributes, you can do so by asking for the subnode tree instead of the content tree. Finally, the property set can define a particular node as being the "principle tree root", meaning that it is the root of the content tree that should be used by default if a specific tree root is not specified. In SGML, this is, of course, the document element, which is the root of the element content tree.] > I have come to assume that the axiom of extensionality is not assumed > to hold, so that distinct sets may have exactly the same elements. Or > did I miss something? Yes. The nodes in a grove are ordered, such that every node is distinguished by being enumerated within the node list property that contains it. Thus the three apparently identical paragraphs above are actually distinguished by their position within the document. In addition, the P element in the second section is distinguished from the first two by having a different origin property (it's origin is the Section element within which it occurs). -- <Address HyTime=bibloc homepage="http://www.drmacro.com"> W. Eliot Kimber, eliot@isogen.com Senior SGML Consulting Engineer, Highland Consulting 2200 North Lamar Street, Suite 230, Dallas, Texas 75202 +1-214-953-0004 +1-214-953-3152 (fax) http://www.isogen.com (work)</Address> "Rats in the morning, rats in the afternoon...if they don't go away, I'll be reducated soon..." --Austin Lounge Lizards, "1984 Blues" (http://www.webcom.com/~yeolde/all/lllhome.html)