[This local archive copy is from the official and canonical URL, http://www.prescod.net/groves/shorttut/; please refer to the canonical source document if possible.]
by Paul Prescod, ISOGEN Consulting Engineer
This paper is a high level introduction to the grove paradigm. Just as SGML was a hidden jewel buried in among the ISO standards for screwdriver heads, groves are another well-kept secret. The time has come to make "groves for the Web". This document should be relevant to the people that would do the specifying and coding to make groves available on the Web, but also to technically-oriented managers that are not interested in the fine details.
Please send me your comments on this document. It will eventually become an ISOGEN technical paper, but it is still rough.
In the early and mid-1990s, the ISO groups that were responsible for the SGML family of standards realized that they had a large problem. The people working on the DSSSL and HyTime standards found that they had slightly different ideas of the abstract structure of an SGML document. Understanding an SGML document's structure is easy for simple things, but there are many issues that are quite complex. For instance, it is not clear whether comments should be available for a DSSSL spec. to work on, or whether they should be addressable by hyperlinks. It isn't clear whether it should be possible to address every character, or only non-contiguous spans of characters. Should it be possible to address and process tokens in an attribute value or only character spans? Should it be possible to address markup declarations? XLink and XSL must solve all of the same issues.
Although this paper will discuss many problem domains, the reader should keep in mind that addressing is the central one. If you cannot address information (e.g. through a URL) then you cannot do anything else you need to it: such as retrieve it, bind methods to it, attach metadata to it, apply access control lists to it, render it, work with it in a programming language and so forth. Addressing is the key. Value follows naturally and immediately.
The reason that addressing into XML (and other data formats) is ill-defined is because the XML specification speaks of the syntax of the XML language, not the abstract, addressible objects encoded in the document. Linking and processing are done in terms of some data model, not in terms of syntax. When you make a link between two elements, you are not linking in terms of the character positions of the start- and end-tags in an SGML or XML entity. You are linking in terms of abstract notions such as "element", "attributes" and "parse tree". The role of an XML parser is to throw away the syntax and rebuild the logical ("abstract") view. The role of a linking engine (such as a web browser) is to make links in terms of that logical view. The role of a stylesheet engine is to apply formatting in terms of that logical view.
Unless stylesheet languages, text databases, formatting engines and editors share a view, processing will be unreliable and complicated. It is not very common for XML and SGML applications and toolkits to provide all of the information necessary for building many classes of sophisticated applications, such as editors. There is not even a standardized way for an toolkit to express what information from the SGML/XML document it will preserve. Even if two toolkits preserve exactly the same information, it is quite possible that they use different terminology to describe the information. In some cases, APIs might be identical except that they use different structures to organize the information! But those one or two features could make navigating the APIs very different.
In the software engineering world we have a technique for avoiding this sort of problem: modelling. Using languages like the Unified Modelling Language (UML) we can build sophisticated, intricate models of the world that can be independently implemented and yet interoperable. I can hand a model of a human resources application to a developer on the other side of the planet and we can build logically compatible applications. Of course UML is at a very high level. The precise expression of an object in a particular programming language or system is not fixed by UML. The UML is a mathematical expression of the entities and relationships in a problem domain. It doesn't usually translate directly into code or APIs. That is why we also have to use more concrete object description languages such as IDL, ODL and STEP Express.
The W3C has partially addressed this situation with a specification called the Document Object Model (DOM). Unfortunately the DOM is not really an object model in the abstract sense. It is rather just a collection of IDL interfaces and some descriptions of how they relate. This is different from an abstract object model because it is too flexible in some places and not flexible enough in others.
The DOM is too flexible in that it is not rigorous enough to be a basis for addressing. For instance the DOM says that a string of four characters could be broken up into multiple text nodes or treated as a single one. If we describe addresses in terms of DOM text nodes, those addresses will be interpreted differently by various DOM implementations. This is one reason that XPointer and XSL are not defined in terms of the DOM. This weakness of the DOM is fatal for using it for addressing but it is also annoying for programmers. In some cases they must write special code to work with documents that have different text breaking algorithms because the DOM has given implementors too much flexibility here. It puts their ease of implementation above the ease of coding for DOM users.
In other ways te DOM is not flexible enough. One important weakness is that it is defined in IDL which does not permit much variation in language mappings and bindings. We have found this very limiting in the Python and Perl worlds. With these high level languages there are more convenient ways of mapping the high level XML concepts into APIs than the ways dictated by CORBA. If we use these ways instead of the DOM ways, however, our APIs are conformant to the DOM only in spirit, not in terms of the formal detail of the specification.
The DOM has a more important inflexibility. It would be useful for the programmer using the DOM to be able to define whether all adjacent text nodes are merged or not. There is a "normalize" method that attempts to provide this feature that method actually modifies the tree. All viewers must see the same view. Another useful view is one in which every character is a separate node. That view allows us to address individual characters very easily. Another view might provide DTD information for a document. Yet another view would provide linking information. Still another view would attach RDF properties to the DOM.
We can also make views that are simpler than the default DOM view. We could have a view that got rid of CDATA nodes and treated them just as text. Another view might remove processing instructions based on the principle that many applications do not use them. It would also be very nice to be able to remove "insignificant" whitespace from a view. The W3C is working on a subset of XML to make XML easier to process for parsers but there is no such spec to make simpler DOMs for application writers.
Let's take this back to the addressing realm for a second. Given all of these views of a document, we could do things like query for the node with the "author" RDF property with value "Paul" or for the node that is reference by a particular hyperlink or for the third character of an element and so forth.
There is an important truth here. Every time we create a new specification built on XML, we implicitly define new properties that should be attached to the nodes: almost a whole new data model! Consider a document type based not only on XML but also on XLink, namespaces and RDF. That document has many different views. Here are some obvious ones:
Let's step back for a minute again. If we can make all of these views available to the programmer in some simple, consistent way, then we could surely make them all available to people doing querying and addressing also. That means that we could make a query language that could do queries based on constructs from all four levels! We could also easily define query languages that were specialized and optimized for a particular level.
The way we currently handle this is through different "levels" of the DOM. The second one is being worked on right now. These levels tend to lag behind the specifications that they are supposed to work with by months or years. There is a DOM for XML, HTML and CSS, but nothing for namespaces, RDF, XLink, XSL queries or XHTML. There is a single group within the W3C that will be responsible for building all of these "levels" of the DOM. This group of intelligent, well-meaning people is the most fundamental bottleneck in the standards world today.
Even if all of the DOM people gave up their day jobs and became full-time DOM builders they could never keep up with the amount of innovation occurring within the W3C. Consider then, that the problem is in no way limited to the W3C. People are building little XML-based languages with their own data models all over the Web. A central API bottleneck is not inconvenient: it is impossible. The DOM cannot be a universal API for all XML-based languages.
In the ISO world we solve this problem by farming out the definition of data models. A "property set" is a formal model for a view of an XML document. A property set is half way between an abstract, unimplementable UML data model and a narrowly defined IDL definition. It speaks in terms of the higher level concepts that are the basis of hypermedia and so can be implemented conveniently in high level programming languages.
The XML Information Set project is similar to the DOM except that it works in terms of abstractions instead of APIs. That is an important first step. But the Information Set is designed only for a single view of XML with certain optional features. It does not seem at this time that it will be possible for "end-users" (programmers and query-writers) to tweak the views. It also does not seem that the model is designed to be extensible to completely new views. In other words it provides the very bottom layer but does not define the infrastructure to build the upper ones.
The important thing about property sets is that they embed and embody the requirements necessary for a data model to be useful in a hypermedia context. That means that every node in the grove is addressable. It is easy to construct an address for any given node (for instance the character under a mouse click) or node list (e.g. a selected list of characters).
Now we get to the really exciting thing about property sets. You can build property sets for views of XML documents. Those are extremely useful and powerful. Even more powerful, though, are property sets for things that are not even XML. SQL databases and OLE objects can have property sets. LaTeX files can have property sets. People have defined experimental property sets for CSS, CGM and for something as abstract as legal documents. After all, a property set is just a simple data model. You could define UML models for all of those types. Defining a property set is no harder.
But property sets have a huge benefit over UML: once you define a property set for a data object type, that data object becomes addressible. This means that every subcomponent of every data object in an enterprise is potentially addressable. The important point is that you do not have to convert all of your data resources into XML or HTML to make them addressable. You may need to turn them into XML or HTML to render or transfer them between machines, but there are many other things that we need to be able to do with addressable resources. We can attach access control lists to to them, make hyperlinks to them, attach metadata to them and so forth.
Some "XML people" have this idea already but they express it in terms of "building a DOM" over some non-XML resource. The idea is right but the expression of the idea is wrong. The logic goes this way: We want an addressable data model for the resource. XML has a data model. Therefore let's use the XML data model for the resource. This model seems logical but it is inefficient both in terms of computer time and programmer time.
If the underlying object is a relational database then it makes no sense to take numbers (for example) and encode them as strings so that the client application can unencode them back to numbers. Similarly, it makes no sense to turn a database record into an XML element so that the final application can think of it as a record again.
If all you need to do is address a database record then what you want to do is the minimum required to turn a database record into something addressable. The grove model is designed so that defining a property set for the database is the minimum you have to do. In this case you can forget about XML altogether!
In buzzword-speak this is "addressing the enterprise." Every data object in an organization from the smallest non-profit to the largest multinational can be addressed through a single data model and query language. You might also think of these as meta-models and meta-query languages in that the grove model and its associated query language give you a framework for defining the details of more precise models and richer query languages.
Let me say again that rendering the document is another matter altogether. Given an address, the easiest way to render the object might be via XML. For a database record this might be the case. For a slide within a PowerPoint document, however, the easiest way to render it might be through OLE. Addressing is separate from rendering. Groves allow you to say that you want to see the slide. OLE or XML/XSL might provide the technology that you need to actually see the slide.
Without groves, hypermedia addressing is very poorly defined. For instance, how do you, today, make a hyperlink to a particular frame in an MPEG movie, or a particular note in a midi sequence? How would you extract that information in a stylesheet (for instance for sequencing a multimedia hyperdocument). It makes no sense to address in terms of bytes, because often a single logical entity, like a frame, is actually spread across several bytes and they may not be contiguous. Addressing in terms of characters would make even less sense because MPEG movies and midi sequences are not character based. The web solves this problem by inventing a new "query language" (in the form of extensions to URLs called "fragment identifiers") for each data type. This more or less works, but it leads to a proliferation of similar, but incompatible query languages doing the same basic thing. These languages have different syntax and underlying models.
This brings us to the next point: implementation. Under today's W3C way of doing things you would implement a hypermedia browser (e.g. SMIL) by hard-coding support for each different type of query for each type of playable object. If resources hyperlinked to each other through these fragment identifers, the implementation engine would have to implement separate query languages for each fragment identifier type. That is an annoying waste of time.
Consider, the issue of metadata attached to parts of media objects through links. For instance I might add a title to an MPEG frame so that I could locate it later. Or I could add a pop-up-video style annotation.
Usually this would be implemented as some sort of on-disk or in-memory database. In one column of a record you would have the properties that you want to attach (expressed somehow). On the other side you need to have things to attach them to. We need a generic term for "things that you can address within media objects." Generically, we could call these "nodes" and "node lists." As soon as you make that leap to describing the targets of references generically, regardless of media type, you have essentially reinvented groves. It follows that standards like RDF implicitly depend upon a (currently underdefined) concept similar to groves. Instead of reinventing them, however, you have the option of using an international standard that specifies them! I hope that one day there will also be a W3C standard that does something similar.
Using groves we can implement a common infrastructure for working with nodes and attaching properties to them, no matter what the media type. In fact there is already a company working on a product that implements this infrastructure. The product is called GroveMinder from TechnoTeacher.
The rest of this document outlines the solution to these problems embraced by XML's parent standard SGML. This solution is known as the "grove paradigm." It should inform a movement to bring "universal addressability" to the Web. Unlike SGML, there are not a lot of extra features in the grove model that make it hard to use. It is relatively hard to implement in its entirety, but it is not hard to use. The DOM puts the complexity burden on users. Groves are easy to use but they require a smarter underlying implementation.
The basic idea underlying groves is that notations like XML and SGML exist only as a syntax for some underlying data model. It is well understood, for instance, that SGML/XML elements form a tree. That is an abstract data model. The sequence of characters in an XML document is not literally a tree structure: it is the syntactic representation of a tree. The actual tree exists only as an abstraction in the head of the author, or in the data structures of the software that processes the document. XML is only interesting because it helps us to serialize these trees so that we can move them from place to place.
An XML document is much more than a simple tree. It has links built by ID/IDREF attributes. If the document uses XPointer, these links can use more complicated addressing mechanism. Knowing about these links is important for any tool that provides even primitive hypertext facilities. An XML document also has logical relationships between elements, element types and the declarations for those types. This might be important for a DTD-editing application to track. Furethmore, an XML document has markup-level details like ignorable whitespace and comments. Knowing about these details is important for editing tools. As you can see, the abstract data model for an XML document is actually quite intricate and complicated. It must address relationships, details of markup, and information about both the physical and logical makeup of the document!
The problem is that XML's data model is implicit. The XML specification alludes to it, but does not describe it. Surprisingly, the XML specification requires a processor to pass all whitespace to the "application", but does not require the processor to pass anything else on! This is not because whitespace is the only important thing: rather it is because everything else is underspecified. SGML was in the same sorry state of underspecification before the grove was developed. The grove solves this by providing a language for describing XML's abstract data model in a rigorous and complete way. You can think of the grove as a meta-"data model" for media applications. It is a data model for building data models. Or you could think of it as providing the low-level primitives for building high-level, media-specific models.
Groves are usually, but not always, tied to particular media types ("notations"). So groves for CGM documents would look fairly different from groves for XML documents. The terminology and semantics of these two specifications are quite different, so you would expect their APIs and query languages to also be fairly different. The grove model is designed to allow them to be exactly as different as they need to be, and no more! What that means is that the basic concepts are the same, but that every media type ("notation") defines its own vocabulary of "properties" in terms of the basic concepts underlying the notation. We call these vocabularies "property sets." In a grove-based view of the world, an XML document is a collection of hundreds of properties, all drawn from the XML property set. This is analogous to the way that a valid XML document is a collection of element types, all drawn from some DTD.
All properties are held in containers called nodes. Nodes represent everything in an XML document: elements, attributes, every significant character, all insignificant whitespace, etc. Groves are so complete that given a complete implementation (GroveMinder is almost complete, Jade is not) HyTime can make a hypertext link to the keyword "#REQUIRED" in an attribute list declaration and DSSSL stylesheets can (in theory) vary their formatting on the amount of whitespace between attribute values in a start-tag. Of course nobody is likely to go that far, but the point is that all of that information is available and addressable. For someone creating (for example) an XML editing or maintenance system, these issues could be important. Consider the case where the only difference between a checked-in document and the version in the archive is insignificant whitespace. A smart repository might choose not to increment the version number.
System designers can ask a grove buider to trim nodes that they do not need from the grove using a "grove plan". This means that your applications do not need to keep track of all of that information if you are not using it. Limited grove builders (such as the free Jade software) can describe their capabilities in terms of grove plans. Two products that claim to support the same grove plan should build identical groves for a particular document.
As an example: one sort of node in an XML document would be an "Element" node. Examples of its properties include its generic identifier (type name), attributes and content. Examples of information that most applications would "trim" out of the grove might include the markup details of the elements' start- and end-tags, entity starts and ends and any ignored whitespace at the start and end of the element.
Groves can also contain so-called "emergent properties." An emergent property is one that is not directly obvious from the syntax, but emerges when a document is processed as a whole. For example, the list of elements with unqiue IDs in a document is only available when the document has been completely processed. In the SGML property set, this list is stored in a property called "elements" on the "SGML Document" node. An emergent property is any property that is computed based on information from various parts of the document.
Another emergent property would be the logical relationship between elements and their element types nodes. If your grove plan contains DTD information, then you can ask an element for its "element type" and get back a node with information about the content model of the element type, allowed attributes, tag ommissability and so forth. Note that even the relationship between element types and their allowed attributes is an emergent property because in SGML/XML, attributes are specified separate from the elements that they apply to.
Property sets are defined in documents that conform to the "propset" DTD. You can think of these documents as simple schemas for property sets. They can specify that particular properties must contain particular types of values (integers, strings, nodes, lists of nodes). They can specify that some properties are so-called "sub-nodal" properties, which means that in the logical tree, the node with the property logically "owns" the node that is the value of the property. For example, elements in the SGML property set have a "subnodal" property called "attributes." This means that elements "own" attributes.
The property set definition language is defined in Annex A of the HyTime specification. As a schema language, it is not as powerful as those used for databases (SQL) or CAD systems (such as EXPRESS), but it is sufficient for our current needs. In the future they may be augmented or replaced by something like EXPRESS.
The most important existing property set is the SGML property set. Although the SGML property set should theoretically be specified in the SGML standard, it is actually described in the HyTime standard. This is due to the fact that the SGML standard predates the grove concept! Nevertheless, the SGML property set is quite complete, robust and well thought out. The SGML property set can be used for XML documents, but it may sometimes be worth creating an XML property set in terms of the terminology and feature set of the XML specification.
Another important property set is the HyTime property set. This provides a data model for the HyTime links in a document. It should be possible to use the HyTime property set to describe XLink links. SGML groves are built from single documents, but HyTime groves are built from "hyperdocuments" constructed from many interlinked SGML documents. Nodes in the HyTime grove point back to the SGML document that contained the HyTime construct.
Other property sets include the Plain Text property set, which has only two classes, one for plaintext documents and one for each data character and the Data Tokenizer property set which can be used to break text up into tokens by separating on whitespace or other characters. These property sets are ISO standards. It is quite likely that other ISO standards will incorporate property sets in the future. Work is underway to create property sets for EXPRESS and STEP data and for the Computer Graphic Metafile(CGM).
ISO is not the only organization that can create property sets. Private individuals such as Sam Hunting and corporations such as TechnoTeacher have also created property sets for everything from schema languages to contract law.
The two most fundamental concepts in the property set paradigm are nodes and properties. Technically speaking, a node is just an "ordered set of properties, representing a single object." In a grove conforming to the SGML property set, examples of objects include elements, start-tags, generic identifiers or anything else in the property set. In a HyTime grove, examples of objects include links, anchors, hyperdocuments etc.
A property is a combination of a name and a value. Conceptually, this is similar to an attribute in XML or SGML. It is important not to confuse properties with XML attributes, however. The concepts are similar because the basic idea of name/value pairs is fundamental to information: think of a phone book or dictionary. Objects in OOP programming languages are also sets of name/value pairs. But the properties in an SGML grove are defined in the SGML property set, not in some particular DTD.
Every node conforms to some node class. Node classes are defined in the property set. All nodes of the same class have exactly the same properties, in exactly the same order. The class also restricts the possible types of the each property.
Properties on nodes appear in the same order as their definition in the property set. Even if an implementation uses an unordered storage model (e.g. a dictionary or hashtable), it can present an ordered view of the properties by checking for the correct order in the property set. In the SGML property set, the first property of an element is its generic identifier (element type name). The second property is its unique identifier (ID), the third property is attributes. By convention, these properties are ordered so that the most commonly used, widely understood properties are first and the less common used or understood properties are later. Thanks to this organization, and to places in the propset DTD that allow commentary, a property set definition document can serve as documentation for the property set.
The easiest way to read a property set is with the output of TechnoTeacher's PropGrinder program. This program reads a property set and generates interlinked HTML pages for the various parts. The propset DTD is very compact and uses very short element type and attribute names. The PropGrinder program turns this terse document into something more readable and navigable.
As you read this document, you should follow along with PropGrinder's description of a simplified SGML property set at the HyTime Users's Group Web Site. Of course, you need to open that document in a separate window.
The so-called "SGML-ESIS" set is simplified in that it only supports the most commonly used parts of the property set (often termed the "ESIS"). It is simplified though a "grove plan." We will discuss grove plans more later.
A node class in a property set acts as a schema for nodes of that class. Every node must have a value for every property declared in its class. We say that the node "exhibits" that property. It is possible for it to exhibit a "null" value, just as in relational database or object oriented theory, but it must always have the property nevertheless. As an interesting diversion, this comparison to relational databases provides a hint of an implementation strategy for property sets. Each class can have a table with as many columns as there are properties.
An example of a node class is the Element class. It will help you to understand what follows if you look at the PropGrinder output for this element while you read. To get there, click on the link called "Classes" and then click on the class called "Element". Once you are there, you will see a display dedicated to information about the element node class. At the top is information about the node class. If you have frames turned on, the bottom left lists the node classes' properties. The bottom right displays information about a particular property.
Just as relational tables have a particular type (string, number, date, etc.), property values have types. The list of these types is in section A.4.1.1 of the HyTime specification. Some of them are very simple: "char" is a character. The "char" property of the "datachar" node is an example of a character property. Character set issues are beyond the scope of this tutorial. "String" is an ordered list of zero or more characters and "strlist", is an ordered list of zero or more strings.
The element's "GI" (generic identifier) property is a string property. Click on the string "GI" in the bottom left to see information about the property in the bottom right. Propgrinder describes the full name of the property: "Generic identifier". It also says that the property is in the default grove plan which means that it should be provided by any grove builder that has not been asked not to provide it. Because it is a string, it gets a "string normalization rule" which basically describes how strings are normalized by the parser.
Its "verify type" is "other." The HyTime specification has this to say about the "verify type": "The attribute verify type (vrfytype) is used by the DSSSL transformation language. It is fully described in the DSSSL standard." The verify type is otherwise beyond the scope of this tutorial. Finally, the property set tells us what clause of the SGML specification defines the concept of a GI (clause 7.8, paragraph 1) and gives us a short diescription of the node: "Generic identifier (element type name) of element."
Other types of properties include "integer", for integral numbers, "intlist", for ordered lists of integers and "boolean", for true/false values.
A slightly more complex type of value is an "enum" or enumeration. This is similar to an enumeration in a programming language or in an SGML/XML attribute value. The property set designer can specify a list of named, valid values for the property. For instance for SGML entity types, the enumerated values are "text", "cdata", "ndata", "sdata", "subdoc" and "pi". You can see this by looking at the "Entity Type" property of the "Entity" class. To do this, click on "Classes" (at the top), then "Entity" (in the list) and then "Entity Type" (in the properties list in the bottom left corner).
There are also node value types called "component name" and "component name lists." A component name represents the name of some grove property, class name or enumerated value. Component names are not just strings, although some grove-based APIs may treat them as strings. You can think of component names as strings that are known to the grove processor "in advance" because they are from the property set. Ordinary strings are not known in advance. The grove makes the distinction because strings that are known in advance may be "compiled" (or, more formally, "interned") into integers and referred to more efficiently.
Another type of value is a nodal value. This is conceptually a "pointer to" or "reference to" another node. You can also have lists of these references to nodes called "node lists." For instance, HyTime links would point to other nodes through nodal values. In fact, the entire grove is constructed through nodal (and node list) values. In the SGML property set, the "SGML Document Node" has a property called "governing doctype" that refers to the DTD that is in use. The type of that property is "document type." It also has a property that refers to the root element of the document, called the "document element" property. The class of this element is just "element". All element nodes (including the document element) point to their children (elements, characters, etc.) through a node list property called "content". This continues down to every data character in the content of an element.
You can follow this path by starting at the SGML document node. Do this by clicking on the "Classes" link at the top and "SgmlDocument" in the class list. You can see the GoverningDoctype and DocumentElement properties I described above. If you click on the DocumentElement property, you will see that it allows nodes of type Element. If you click on the word Element, the display will change to present information about the Element class. It should look familiar. We've been here before. From there, you can drill down into the valid content for an element by clicking on the Content property in the bottom left corner. From there you can see that the Content property is a node list which allows "Data Char" nodes, along with elements, external data, processing instructions and sdata.
Attributes are slightly different. The attributes property of Element nodes is of type named node list. A named node list is like a node list in that it is an ordered list of nodes. But it is more than an ordinary node list. Each node in the named node list is assigned a name based on some property of the node. For example, the attributes property of element nodes can contain only attribute assignment nodes. Each of these nodes must have a "name" property.
You can see that this property is the "name property" of the "attributes" named node list. First go to the Element class page as we did before. Then click on the word "Attributes" in the properties list for the class "element." In the bottom right hand corner, there should be a box labelled "Allowed Classes" with a small table (with perhaps only one item in it). On the left side is the name of a class that is allowed, and on the right side is the "name property" of that class. The word "name" should appear because "name" happens to be the "name property" for the "attribute assignment" nodes in the "attributes" named node list of the "elements node." Click on the word "name" to see the "attribute assignment" class description. As a shortcut, we can say that the attrirubes named node list is "indexed" by the attributes' names. This is another hint
As you can see, the grove can be thought of as a sort of "parse tree" with nodes containing nodes. But not every property expresses a logical "contains" relationship. Some properties express other sorts of relationships. For instance, an entity reference node must point to its entity definition node and similiarly an element node must point to its element type node. There is no logical container relationship there. We refer to properties that express a container relationship as "sub-nodal" properties. We refer to properties that express any other relationship as an "internal reference" ("irefnode") or "unrestrained reference" ("urefnode"). A node referenced through an irefnode property must be in the same grove as the referencing (property containing) node. Unrestrained referenced nodes may be in the same grove or another one.
It is quite common to want to do something with each node in a grove. For instance you might want a printed representation of a grove, or you might want to store it in a database. It is easy to visit each node because every node except the grove's root node occurs once and only once as a subnode of some other node. The other node is referred to as the subnode's origin. Every node has an "origin" property that allows us to find the nodes origin. The grove root's "origin" property is null.
"Origin" is the first property we have discussed that is common to every single grove, no matter what its property set. These common properties are called "intrinsic properties." We will discuss these later.
In many media types, there is a distinction between content data and metadata. Arguably, in SGML and XML, the DTD is metadata, attributes are metadata and character data is "real" data. The property set paradigm allows the property set definer to specify the difference between metadata and data. Every node is allowed to have one property that it distinguishes as being its "children" property. The name of the property does not have to be "children". For SGML/XML elements, it is just "content." The children property (whatever its name) is considered to be the data. All other properties are not. Presumably they are metadata. There is only one children property allowed per node.
Imagine we are writing software that works with any sort of grove. We might want to know what a node's children property is. One way to do that would be to read the property set document (it is an SGML document, after all). That is rather a hassle, though. It is possible to ask a node what its children property's name is, rather than reading the property set document. Every node exhibits a component name-valued property called "Children Property Name". If the node class has a children property, the value of the children property name property must be the name of some other property that the class exhibits. For instance, element nodes would have a "children property name" property with a value of "content" because the element's content property is its children property.
Intuitively, this corresponds to the tree that you would draw of of an SGML document: the logical parse tree. It includes elements and their contents, but not things such as attributes, the DTD, start- and end-tags, etc. This tree also corresponds to the logical tree that almost every existing stylesheet language works upon. By default, stylesheet languages typically look at each element and output its data content, ignoring its attributes. They make the same distinction between metadata and content data.
Of course, any particular DTD designer might encode something she logically considers metadata as an element instead of an attribute: but they will find that the grove and most SGML/XML software (whether grove-based or not) will not really support them in that decision. SGML/XML practitioners have various techniques for working around this: for instance, if your application does not need the distinction between data and metadata, it can treat ordinary subnode properties and children properties the same.
A similar, related property is the data property. Some nodes carry content data but have no subnodes. The node might carry its data in a character or string property instead of in subnodes. Obviously a data character node cannot have some other node as its subnode and so on infinitely! Instead of having a children property, these nodes have a data property. The data property is specified with a data property name intrinsic property. Every node class can have either a data property or a children property, but not both. Either one can be referred to as the content property. For instance, the notation class does not have either, so it has no content property. On the other hand, the Element class has "Content" as its content property. You can see this by looking at the class description for the Element class output by PropGrinder.
Using the content property, it is possible to extract the data of a node automatically. The data of a node with a data property is the value of that property. Of course this data can only be a character or string. The data of a node with a children property is the concatenation of the data produced by each of the node's children perhaps separated by a data separator. A data separator is specified through a Data Separator property which is found based on a Data Separator Property Name property.
There are certain properties that every node exhibits, no matter what its class or its property set. These are called the intrinsic properties. We do not repetitively define these in a property set definition document, so you will not find them in the definitions for every node.
We have already discussed the Children Property Name, Data Property Name and Data Separator Property Name properties. These say which property (if any) should be considered to be the content property of a node and if the node has children, what separator should be placed between the children's data when they are concatenated. Since a node can have at most one of these "content" properties, either the children or data property is required to be null, and in some cases they may both be null. These properties are provided to make the property set "reflexive". The goal is to be able to learn things about a node's definition from its properties. The data separator property works with the data property name to allow us to properly construct the data of the node.
Another important intrinsic property is All Property Names. This is also a property that reflects the nodes definition. By asking a node for all of its property names, we can write software that can work with groves without knowing about particular property sets. For example, an application could store nodes in a database or transmit them over the Web without reading their property set definition document or having hard-coded information about them. A similar property is Subnode Property Names which is a list of the names of all of the subnode properties of the node.
Also related to reflexiveness is the Class Name property. Given a node, it is possible to ask it for its class name and base special processing on that name. For example, a class name could be used to look up the documentation for a class in the property set definition document.
Other intrinsic properties help with navigation around the grove. Every node has a Grove Root property that describes how to get to the root node of the current grove. Of course you could find this property in code by looking at the origin of the node, and the origin of its origin, and so on. Eventually when you find a node with a null origin, you have found the grove root. The grove root property makes this easier, however.
Another navigational property is the Origin to Subnode Relationship Property Name property. Every node in the grove except the root has an origin. The node must be referred to in some sub-nodal property of its origin. That property is known as the Origin to Subnode Relationship property name. This property allows you to navigate up to the origin node and then back down to the subnode. If the Origin to Subnode Relationship is a children property, then we say that the subnode is a child and the origin node is a parent. If a node is a child, we can get its parent node with the intrinsic Parent property.
The final intrinsic property is the Tree Root property. This is equivalent to walking up from node to node as long as each node has a parent. The node that has no parent is the tree root. It might not be the grove root, however. If the node has an origin, then it is not the grove root, but if the origin is not a parent then the node is a tree root.
The HyTime specification defines a grove plan as "a specification of what modules, classes, and properties to include in a grove. Grove plans are used both to construct groves and to view existing groves." Grove plans are defined through "grove plan" elements. These elements are defined in section 18.104.22.168 of the HyTime specification. They can include or exclude classes and properties.
Sets of property set components may be grouped together so that they can be include or excluded from a grove plan at once. The grove plan element can include or exclude them all at once.
The SGML property set defines the results of an SGML parse. It serves as a data model for hypertext linking and as a basis for SGML creation and management APIs. This section will describe the major node types in the SGML property set.
The root node of an SGML grove is always an SGML document node. SGML Document nodes never occur as subnodes of other nodes. The SGML document node has among its subnodes the document element, the DTD and various sorts of document-global information. Some important properties of the SGML document node are:
Most SGML processing is controlled by elements and their generic identifiers (element type names). Element nodes can occur as the document element or in the content of some other element.
Attributes are very common in SGML/XML documents. They allow authors to attach extra information (strings) to elements. The nodes that represent attribute assignments are different from those that represent attribute definitions (in the DTD). Here are some of the important properties of attribute assignment nodes:
Attribute value token nodes represent the value of tokenized attributes. If you want to work with each token of a tokenized attribute value individually, you should use attribute value tokens instead of asking for the data of the attribute assignment.
You can find an elements with a certain ID by asking for the elements property of the SGML document node. In Python, given any node, you could get the element with the ID foo like this:
node.GroveRoot.Elements["FOO"]. For instance to resolve an IDREF attribute named REF:
You can find an entity by name using the same procedure as for elements, but use the "Entities" property of the SGML Document node instead of the "Elements" property:
The definition of the grove paradigm is the Property Set Definition Requirements annex of the HyTime specification. It is quite readable, especially if you have already read this tutorial. You can also contact me with questions, but I can't guarantee rapid response unless you are a consulting customer.
If you want to help pursue a web version of the grove paradigm, the right place to do so is within the XML-DEV mailing list. Many bright people there are already working with groves even though they are not yet a W3C standard officially connected to XML.