[This local archive copy is from the official and canonical URL, http://www.prescod.net/groves/shorttut/; please refer to the canonical source document if possible.]
by Paul Prescod, ISOGEN Consulting Engineer
This paper is an introduction to the grove paradigm. It serves several purposes:
Please send me your comments on this document. It will eventually become an ISOGEN technical paper, but it is still rough.
In the early and mid-1990s, the ISO groups that were responsible for the SGML family of standards realized that they had a large problem. The people working on the DSSSL and HyTime standards found that they had slightly different ideas of the abstract structure of an SGML document. Understanding an SGML document's structure is easy for simple things, but there are many issues that are quite complex. For instance, it is not clear whether comments should be available for a DSSSL spec. to work on, or whether they should be addressable by hyperlinks. It isn't clear whether it should be possible to address every character, or only non-contiguous spans of characters. Should it be possible to address and process tokens in an attribute value or only character spans? Should it be possible to address markup declarations? XLink and XSL must solve all of the same issues.
The problem is that XML is defined in terms of its syntax, just as SGML was at first. Linking and processing are done in terms of some data model, not in terms of syntax. When you make a link between two elements, you are not linking in terms of the character positions of the start- and end-tags in an SGML or XML entity. You are linking in terms of abstract notions such as "element", "attributes" and "parse tree". The role of an XML parser is to throw away the syntax and rebuild the logical ("abstract") view. The role of a linking engine (such as a web browser) is to make links in terms of that logical view. The role of a stylesheet engine is to apply formatting in terms of that logical view.
Unless stylesheet languages, text databases, formatting engines and editors share a view, reliable processing will be unreliable and complicated. It is not very common for XML and SGML applications and toolkits to provide all of the information necessary for building many classes of sophisticated applications, such as editors. There is not even a standardized way for an toolkit to express what information from the SGML/XML document it will preserve. Even if two toolkits preserve exactly the same information, it is quite possible that they use different terminology to describe the information. In some cases, APIs might be identical except for trivial wording!
A related problem is how to address components of data types other than SGML and XML. For instance, how do you make a hyperlink to a particular frame in an MPEG movie, or a particular note in a midi sequence? How would you extract that information in a stylesheet (for instance for sequencing a multimedia hyperdocument). It makes no sense to address in terms of bytes, because often a single logical entity, like a frame, is actually spread across several bytes and they may not be contiguous. Addressing in terms of characters would make even less sense because MPEG movies and midi sequences are not character based. The web solves this problem by inventing a new "query language" (in the form of extensions to URLs) for each data type. This more or less works, but it leads to a proliferation of similar, but incompatible query languages doing the same basic thing, but with different underlying models.
The grove concept is intended to solve these problems. You can decide whether it succeeds for yourself.
The basic idea underlying groves is that notations like XML and SGML exist only as a syntax for some underlying data model. It is well understood, for instance, that SGML/XML elements form a tree. That is an abstract data model. The sequence of characters in an XML document is not literally a tree structure: it is the syntactic representation of a tree. The actual tree exists only as an abstraction in the head of the author, or in the data structures of the software that processes the document. XML is only interesting because it helps us to serialize these trees so that we can move them from place to place.
An XML document is much more than a simple tree. It has links built by ID/IDREF attributes. If the document uses XPointer, these links can use more complicated addressing mechanism. Knowing about these links is important for any tool that provides even primitive hypertext facilities. An XML document also has logical relationships between elements, element types and the declarations for those types. This might be important for a DTD-editing application to track. Furethmore, an XML document has markup-level details like ignorable whitespace and comments. Knowing about these details is important for editing tools. As you can see, the abstract data model for an XML document is actually quite intricate and complicated. It must address relationships, details of markup, and information about both the physical and logical makeup of the document!
The problem is that XML's data model is implicit. The XML specification alludes to it, but does not describe it. Surprisingly, the XML specification requires a processor to pass all whitespace to the "application", but does not require the processor to pass anything else on! This is not because whitespace is the only important thing: rather it is because everything else is underspecified. SGML was in the same sorry state of underspecification before the grove was developed. The grove solves this by providing a language for describing XML's abstract data model in a rigorous and complete way. You can think of the grove as a meta-"data model" for media applications. It is a data model for building data models. Or you could think of it as providing the low-level primitives for building high-level, media-specific models.
Groves are usually, but not always, tied to particular media types ("notations"). So groves for CGM documents would look fairly different from groves for XML documents. The terminology and semantics of these two specifications are quite different, so you would expect their APIs and query languages to also be fairly different. The grove model is designed to allow them to be exactly as different as they need to be, and no more! What that means is that the basic concepts are the same, but that every media type ("notation") defines its own vocabulary of "properties" in terms of the basic concepts underlying the notation. We call these vocabularies "property sets." In a grove-based view of the world, an XML document is a collection of hundreds of properties, all drawn from the XML property set. This is analogous to the way that a valid XML document is a collection of element types, all drawn from some DTD.
All properties are held in containers called nodes. Nodes represent everything in an XML document: elements, attributes, every significant character, all insignificant whitespace, etc. Groves are so complete that given a complete implementation (GroveMinder is almost complete, Jade is not) HyTime can make a hypertext link to the keyword "#REQUIRED" in an attribute list declaration and DSSSL stylesheets can (in theory) vary their formatting on the amount of whitespace between attribute values in a start-tag. Of course nobody is likely to go that far, but the point is that all of that information is available and addressable. For someone creating (for example) an XML editing or maintenance system, these issues could be important. Consider the case where the only difference between a checked-in document and the version in the archive is insignificant whitespace. A smart repository might choose not to increment the version number.
System designers can ask a grove buider to trim nodes that they do not need from the grove using a "grove plan". This means that your applications do not need to keep track of all of that information if you are not using it. Limited grove builders (such as the free Jade software) can describe their capabilities in terms of grove plans. Two products that claim to support the same grove plan should build identical groves for a particular document.
As an example: one sort of node in an XML document would be an "Element" node. Examples of its properties include its generic identifier (type name), attributes and content. Examples of information that most applications would "trim" out of the grove might include the markup details of the elements' start- and end-tags, entity starts and ends and any ignored whitespace at the start and end of the element.
Groves can also contain so-called "emergent properties." An emergent property is one that is not directly obvious from the syntax, but emerges when a document is processed as a whole. For example, the list of elements with unqiue IDs in a document is only available when the document has been completely processed. In the SGML property set, this list is stored in a property called "elements" on the "SGML Document" node. An emergent property is any property that is computed based on information from various parts of the document.
Another emergent property would be the logical relationship between elements and their element types nodes. If your grove plan contains DTD information, then you can ask an element for its "element type" and get back a node with information about the content model of the element type, allowed attributes, tag ommissability and so forth. Note that even the relationship between element types and their allowed attributes is an emergent property because in SGML/XML, attributes are specified separate from the elements that they apply to.
Property sets are defined in documents that conform to the "propset" DTD. You can think of these documents as simple schemas for property sets. They can specify that particular properties must contain particular types of values (integers, strings, nodes, lists of nodes). They can specify that some properties are so-called "sub-nodal" properties, which means that in the logical tree, the node with the property logically "owns" the node that is the value of the property. For example, elements in the SGML property set have a "subnodal" property called "attributes." This means that elements "own" attributes.
The property set definition language is defined in Annex A of the HyTime specification. As a schema language, it is not as powerful as those used for databases (SQL) or CAD systems (such as EXPRESS), but it is sufficient for our current needs. In the future they may be augmented or replaced by something like EXPRESS.
The most important existing property set is the SGML property set. Although the SGML property set should theoretically be specified in the SGML standard, it is actually described in the HyTime standard. This is due to the fact that the SGML standard predates the grove concept! Nevertheless, the SGML property set is quite complete, robust and well thought out. The SGML property set can be used for XML documents, but it may sometimes be worth creating an XML property set in terms of the terminology and feature set of the XML specification.
Another important property set is the HyTime property set. This provides a data model for the HyTime links in a document. It should be possible to use the HyTime property set to describe XLink links. SGML groves are built from single documents, but HyTime groves are built from "hyperdocuments" constructed from many interlinked SGML documents. Nodes in the HyTime grove point back to the SGML document that contained the HyTime construct.
Other property sets include the Plain Text property set, which has only two classes, one for plaintext documents and one for each data character and the Data Tokenizer property set which can be used to break text up into tokens by separating on whitespace or other characters. These property sets are ISO standards. It is quite likely that other ISO standards will incorporate property sets in the future. Work is underway to create property sets for EXPRESS and STEP data and for the Computer Graphic Metafile(CGM).
ISO is not the only organization that can create property sets. Private individuals such as Sam Hunting and corporations such as TechnoTeacher have also created property sets for everything from schema languages to contract law.
The two most fundamental concepts in the property set paradigm are nodes and properties. Technically speaking, a node is just an "ordered set of properties, representing a single object." In a grove conforming to the SGML property set, examples of objects include elements, start-tags, generic identifiers or anything else in the property set. In a HyTime grove, examples of objects include links, anchors, hyperdocuments etc.
A property is a combination of a name and a value. Conceptually, this is similar to an attribute in XML or SGML. It is important not to confuse properties with XML attributes, however. The concepts are similar because the basic idea of name/value pairs is fundamental to information: think of a phone book or dictionary. Objects in OOP programming languages are also sets of name/value pairs. But the properties in an SGML grove are defined in the SGML property set, not in some particular DTD.
Every node conforms to some node class. Node classes are defined in the property set. All nodes of the same class have exactly the same properties, in exactly the same order. The class also restricts the possible types of the each property.
Properties on nodes appear in the same order as their definition in the property set. Even if an implementation uses an unordered storage model (e.g. a dictionary or hashtable), it can present an ordered view of the properties by checking for the correct order in the property set. In the SGML property set, the first property of an element is its generic identifier (element type name). The second property is its unique identifier (ID), the third property is attributes. By convention, these properties are ordered so that the most commonly used, widely understood properties are first and the less common used or understood properties are later. Thanks to this organization, and to places in the propset DTD that allow commentary, a property set definition document can serve as documentation for the property set.
The easiest way to read a property set is with the output of TechnoTeacher's PropGrinder program. This program reads a property set and generates interlinked HTML pages for the various parts. The propset DTD is very compact and uses very short element type and attribute names. The PropGrinder program turns this terse document into something more readable and navigable.
As you read this document, you should follow along with PropGrinder's description of a simplified SGML property set at the HyTime Users's Group Web Site. Of course, you need to open that document in a separate window.
The so-called "SGML-ESIS" set is simplified in that it only supports the most commonly used parts of the property set (often termed the "ESIS"). It is simplified though a "grove plan." We will discuss grove plans more later.
A node class in a property set acts as a schema for nodes of that class. Every node must have a value for every property declared in its class. We say that the node "exhibits" that property. It is possible for it to exhibit a "null" value, just as in relational database or object oriented theory, but it must always have the property nevertheless. As an interesting diversion, this comparison to relational databases provides a hint of an implementation strategy for property sets. Each class can have a table with as many columns as there are properties.
An example of a node class is the Element class. It will help you to understand what follows if you look at the PropGrinder output for this element while you read. To get there, click on the link called "Classes" and then click on the class called "Element". Once you are there, you will see a display dedicated to information about the element node class. At the top is information about the node class. If you have frames turned on, the bottom left lists the node classes' properties. The bottom right displays information about a particular property.
Just as relational tables have a particular type (string, number, date, etc.), property values have types. The list of these types is in section A.4.1.1 of the HyTime specification. Some of them are very simple: "char" is a character. The "char" property of the "datachar" node is an example of a character property. Character set issues are beyond the scope of this tutorial. "String" is an ordered list of zero or more characters and "strlist", is an ordered list of zero or more strings.
The element's "GI" (generic identifier) property is a string property. Click on the string "GI" in the bottom left to see information about the property in the bottom right. Propgrinder describes the full name of the property: "Generic identifier". It also says that the property is in the default grove plan which means that it should be provided by any grove builder that has not been asked not to provide it. Because it is a string, it gets a "string normalization rule" which basically describes how strings are normalized by the parser.
Its "verify type" is "other." The HyTime specification has this to say about the "verify type": "The attribute verify type (vrfytype) is used by the DSSSL transformation language. It is fully described in the DSSSL standard." The verify type is otherwise beyond the scope of this tutorial. Finally, the property set tells us what clause of the SGML specification defines the concept of a GI (clause 7.8, paragraph 1) and gives us a short diescription of the node: "Generic identifier (element type name) of element."
Other types of properties include "integer", for integral numbers, "intlist", for ordered lists of integers and "boolean", for true/false values.
A slightly more complex type of value is an "enum" or enumeration. This is similar to an enumeration in a programming language or in an SGML/XML attribute value. The property set designer can specify a list of named, valid values for the property. For instance for SGML entity types, the enumerated values are "text", "cdata", "ndata", "sdata", "subdoc" and "pi". You can see this by looking at the "Entity Type" property of the "Entity" class. To do this, click on "Classes" (at the top), then "Entity" (in the list) and then "Entity Type" (in the properties list in the bottom left corner).
There are also node value types called "component name" and "component name lists." A component name represents the name of some grove property, class name or enumerated value. Component names are not just strings, although some grove-based APIs may treat them as strings. You can think of component names as strings that are known to the grove processor "in advance" because they are from the property set. Ordinary strings are not known in advance. The grove makes the distinction because strings that are known in advance may be "compiled" (or, more formally, "interned") into integers and referred to more efficiently.
Another type of value is a nodal value. This is conceptually a "pointer to" or "reference to" another node. You can also have lists of these references to nodes called "node lists." For instance, HyTime links would point to other nodes through nodal values. In fact, the entire grove is constructed through nodal (and node list) values. In the SGML property set, the "SGML Document Node" has a property called "governing doctype" that refers to the DTD that is in use. The type of that property is "document type." It also has a property that refers to the root element of the document, called the "document element" property. The class of this element is just "element". All element nodes (including the document element) point to their children (elements, characters, etc.) through a node list property called "content". This continues down to every data character in the content of an element.
You can follow this path by starting at the SGML document node. Do this by clicking on the "Classes" link at the top and "SgmlDocument" in the class list. You can see the GoverningDoctype and DocumentElement properties I described above. If you click on the DocumentElement property, you will see that it allows nodes of type Element. If you click on the word Element, the display will change to present information about the Element class. It should look familiar. We've been here before. From there, you can drill down into the valid content for an element by clicking on the Content property in the bottom left corner. From there you can see that the Content property is a node list which allows "Data Char" nodes, along with elements, external data, processing instructions and sdata.
Attributes are slightly different. The attributes property of Element nodes is of type named node list. A named node list is like a node list in that it is an ordered list of nodes. But it is more than an ordinary node list. Each node in the named node list is assigned a name based on some property of the node. For example, the attributes property of element nodes can contain only attribute assignment nodes. Each of these nodes must have a "name" property.
You can see that this property is the "name property" of the "attributes" named node list. First go to the Element class page as we did before. Then click on the word "Attributes" in the properties list for the class "element." In the bottom right hand corner, there should be a box labelled "Allowed Classes" with a small table (with perhaps only one item in it). On the left side is the name of a class that is allowed, and on the right side is the "name property" of that class. The word "name" should appear because "name" happens to be the "name property" for the "attribute assignment" nodes in the "attributes" named node list of the "elements node." Click on the word "name" to see the "attribute assignment" class description. As a shortcut, we can say that the attrirubes named node list is "indexed" by the attributes' names. This is another hint
As you can see, the grove can be thought of as a sort of "parse tree" with nodes containing nodes. But not every property expresses a logical "contains" relationship. Some properties express other sorts of relationships. For instance, an entity reference node must point to its entity definition node and similiarly an element node must point to its element type node. There is no logical container relationship there. We refer to properties that express a container relationship as "sub-nodal" properties. We refer to properties that express any other relationship as an "internal reference" ("irefnode") or "unrestrained reference" ("urefnode"). A node referenced through an irefnode property must be in the same grove as the referencing (property containing) node. Unrestrained referenced nodes may be in the same grove or another one.
It is quite common to want to do something with each node in a grove. For instance you might want a printed representation of a grove, or you might want to store it in a database. It is easy to visit each node because every node except the grove's root node occurs once and only once as a subnode of some other node. The other node is referred to as the subnode's origin. Every node has an "origin" property that allows us to find the nodes origin. The grove root's "origin" property is null.
"Origin" is the first property we have discussed that is common to every single grove, no matter what its property set. These common properties are called "intrinsic properties." We will discuss these later.
In many media types, there is a distinction between content data and metadata. Arguably, in SGML and XML, the DTD is metadata, attributes are metadata and character data is "real" data. The property set paradigm allows the property set definer to specify the difference between metadata and data. Every node is allowed to have one property that it distinguishes as being its "children" property. The name of the property does not have to be "children". For SGML/XML elements, it is just "content." The children property (whatever its name) is considered to be the data. All other properties are not. Presumably they are metadata. There is only one children property allowed per node.
Imagine we are writing software that works with any sort of grove. We might want to know what a node's children property is. One way to do that would be to read the property set document (it is an SGML document, after all). That is rather a hassle, though. It is possible to ask a node what its children property's name is, rather than reading the property set document. Every node exhibits a component name-valued property called "Children Property Name". If the node class has a children property, the value of the children property name property must be the name of some other property that the class exhibits. For instance, element nodes would have a "children property name" property with a value of "content" because the element's content property is its children property.
Intuitively, this corresponds to the tree that you would draw of of an SGML document: the logical parse tree. It includes elements and their contents, but not things such as attributes, the DTD, start- and end-tags, etc. This tree also corresponds to the logical tree that almost every existing stylesheet language works upon. By default, stylesheet languages typically look at each element and output its data content, ignoring its attributes. They make the same distinction between metadata and content data.
Of course, any particular DTD designer might encode something she logically considers metadata as an element instead of an attribute: but they will find that the grove and most SGML/XML software (whether grove-based or not) will not really support them in that decision. SGML/XML practitioners have various techniques for working around this: for instance, if your application does not need the distinction between data and metadata, it can treat ordinary subnode properties and children properties the same.
A similar, related property is the data property. Some nodes carry content data but have no subnodes. The node might carry its data in a character or string property instead of in subnodes. Obviously a data character node cannot have some other node as its subnode and so on infinitely! Instead of having a children property, these nodes have a data property. The data property is specified with a data property name intrinsic property. Every node class can have either a data property or a children property, but not both. Either one can be referred to as the content property. For instance, the notation class does not have either, so it has no content property. On the other hand, the Element class has "Content" as its content property. You can see this by looking at the class description for the Element class output by PropGrinder.
Using the content property, it is possible to extract the data of a node automatically. The data of a node with a data property is the value of that property. Of course this data can only be a character or string. The data of a node with a children property is the concatenation of the data produced by each of the node's children perhaps separated by a data separator. A data separator is specified through a Data Separator property which is found based on a Data Separator Property Name property.
There are certain properties that every node exhibits, no matter what its class or its property set. These are called the intrinsic properties. We do not repetitively define these in a property set definition document, so you will not find them in the definitions for every node.
We have already discussed the Children Property Name, Data Property Name and Data Separator Property Name properties. These say which property (if any) should be considered to be the content property of a node and if the node has children, what separator should be placed between the children's data when they are concatenated. Since a node can have at most one of these "content" properties, either the children or data property is required to be null, and in some cases they may both be null. These properties are provided to make the property set "reflexive". The goal is to be able to learn things about a node's definition from its properties. The data separator property works with the data property name to allow us to properly construct the data of the node.
Another important intrinsic property is All Property Names. This is also a property that reflects the nodes definition. By asking a node for all of its property names, we can write software that can work with groves without knowing about particular property sets. For example, an application could store nodes in a database or transmit them over the Web without reading their property set definition document or having hard-coded information about them. A similar property is Subnode Property Names which is a list of the names of all of the subnode properties of the node.
Also related to reflexiveness is the Class Name property. Given a node, it is possible to ask it for its class name and base special processing on that name. For example, a class name could be used to look up the documentation for a class in the property set definition document.
Other intrinsic properties help with navigation around the grove. Every node has a Grove Root property that describes how to get to the root node of the current grove. Of course you could find this property in code by looking at the origin of the node, and the origin of its origin, and so on. Eventually when you find a node with a null origin, you have found the grove root. The grove root property makes this easier, however.
Another navigational property is the Origin to Subnode Relationship Property Name property. Every node in the grove except the root has an origin. The node must be referred to in some sub-nodal property of its origin. That property is known as the Origin to Subnode Relationship property name. This property allows you to navigate up to the origin node and then back down to the subnode. If the Origin to Subnode Relationship is a children property, then we say that the subnode is a child and the origin node is a parent. If a node is a child, we can get its parent node with the intrinsic Parent property.
The final intrinsic property is the Tree Root property. This is equivalent to walking up from node to node as long as each node has a parent. The node that has no parent is the tree root. It might not be the grove root, however. If the node has an origin, then it is not the grove root, but if the origin is not a parent then the node is a tree root.
The HyTime specification defines a grove plan as "a specification of what modules, classes, and properties to include in a grove. Grove plans are used both to construct groves and to view existing groves." Grove plans are defined through "grove plan" elements. These elements are defined in section 18.104.22.168 of the HyTime specification. They can include or exclude classes and properties.
Sets of property set components may be grouped together so that they can be include or excluded from a grove plan at once. The grove plan element can include or exclude them all at once.
The SGML property set defines the results of an SGML parse. It serves as a data model for hypertext linking and as a basis for SGML creation and management APIs. This section will describe the major node types in the SGML property set.
The root node of an SGML grove is always an SGML document node. SGML Document nodes never occur as subnodes of other nodes. The SGML document node has among its subnodes the document element, the DTD and various sorts of document-global information. Some important properties of the SGML document node are:
Most SGML processing is controlled by elements and their generic identifiers (element type names). Element nodes can occur as the document element or in the content of some other element.
Attributes are very common in SGML/XML documents. They allow authors to attach extra information (strings) to elements. The nodes that represent attribute assignments are different from those that represent attribute definitions (in the DTD). Here are some of the important properties of attribute assignment nodes:
Attribute value token nodes represent the value of tokenized attributes. If you want to work with each token of a tokenized attribute value individually, you should use attribute value tokens instead of asking for the data of the attribute assignment.
You can find an elements with a certain ID by asking for the elements property of the SGML document node. In Python, given any node, you could get the element with the ID foo like this:
node.GroveRoot.Elements["FOO"]. For instance to resolve an IDREF attribute named REF:
You can find an entity by name using the same procedure as for elements, but use the "Entities" property of the SGML Document node instead of the "Elements" property:
The defintion of the grove paradigm is the Property Set Definition Requirements annex of the HyTime specification. It is quite readable, especially if you have already read this tutorial. You can also contact me with questions, but I can't guarantee rapid response. For longer engagements, we can arrange a consulting contract. My employer, ISOGEN, is the leading supplier of grove-based document management know-how.