Python and SGML


Subject:   Python and SGML 
Author:    W. Eliot Kimber
Email:     eliot@isogen.com
Date:      1998/11/02
Groups:    comp.text.sgml

Since I mentioned Python and SGML in an earlier post, I thought I'd show how easy it is to do really powerful stuff for free using Python and the Jade package on Windows (sorry, equivalent not yet available for Unix, but stay tuned....).

Python (<http://www.python.org>) is a brilliant programming language that is, to my taste, ideally suited to SGML and XML processing. Its easy-to-use object orientation, its built-in list semantics, and the fact that it's interpreted make it really easy to create the same sorts of programs you might use DSSSL or Balise for, but with a general-purpose programming language that is easy to learn and much more familiar that DSSSL or Omnimark. Python is a free, publicly-developed language, not a commercial product.

Python comes with support for COM/ActiveX under windows. With the Jade package (<http://www.jclark.com/jade>) you get the groveoa.dll, which is a COM server that provides SGML grove access, at least to the instance properties. This means you can quickly and easily do grove-based processing from Python. This is very cool.

To use the following sample program all you need to do is download the base Python package and the win32com package, both available for free from the Python.org site. Follow their installation instructions. Download the latest version of Jade and put all the dlls somewhere in your DOS PATH (e.g., /windows/system). [Be careful if you already use SP-based tools--there may be different versions of these DLLs hanging about on your system--it's easy to get things confused by having different versions of the DLLs in different places.]

To create a program that parses a document into memory (from which you can then do anything you want), you just do this:

#---------------------------------------------------------------
# Python program to demonstrate grove-based processing using
# groveoa.dll. 
# Author: W. Eliot Kimber, eliot@isogen.com
#---------------------------------------------------------------
import win32com.client
import sys
# See Python library ref for other libraries you might want, like
# string, regexp, glob, etc.

def construct_SGML_grove (systemid):
  "Parses an SGML or XML document into memory. Returns grove root node."
  print "Processing document " + systemid
  # Create a "grove builder" object:
  gb = win32com.client.Dispatch("SP.GroveBuilder")
  # Now use the grove builder to construct an SGML document grove:
  grove = gb.parse(systemid)
  return(grove)

#-----------------------------------
# Main processing:
#----------------------------------
if len(sys.argv) >  1:
  sgmldoc = construct_SGML_grove(sys.argv[1])
  if sgmldoc.DocumentElement:
    print "Document element type='" + sgmldoc.DocumentElement.Gi + "'"
    # Note that the grove nodes behave just like any other Python
    # object. Use the dir() function and Python object browser to
    # determine the available methods and properties. All the 
    # properties defined in the SGML property set are accessed using
    # their application names with spaces removed and init cap on
    # each word.  You can use normal Python list processing to 
    # process node lists in the grove, e.g.:
    # for node in sgmldoc.DocumentElement.Content:
    #   print node.Class
    # 
    # Which will print out the node class of each content node
    # (for groveoa.dll, these are enumerations so you get an 
    # integer, not a name--I've got the enumerations mapped to
    # names in some code if you want them, send me email.
else:
  print "You must specify the file name of a document to process."
#-- End of program

You can also use Python's interactive mode to parse a document into memory and then explore the resulting grove. When using the Pythonwin window, you can use the Python object browser to explore the grove visually.

If you've been using Perl and NSGMLS to do SGML or XML processing, or if you've been using DSSSL and Jade to do transforms (not formatting), I'd urge you to try out the above sample. I think you'll find it encouragingly easy to use, with lots of potential.

While I like DSSSL and Jade tremendously, they do have two severe limitations which the above approach doesn't:

1. The DSSSL syntax, while elegant and well suited to the task, is a significant barrier to use and maintainability. 2. Jade does not provide any form of API or facilities for interacting with the rest of the system.

By using Python with groveoa.dll (or any other grove-based system--see below), you get all the processing power you get with Jade plus you get a much more familiar and easy-to-learn programming language. You also get the ability to integrate with a wide variety of systems, including COM-based stuff (e.g., VB, Delphi, PowerBuilder).

Note that this combination does not provide all the functions that commercial tools like Omnimark and Balize provide, although it does do a lot of things you might otherwise need those sorts of tools for.

Speaking of additional tools, there will soon be more grove-based, Python-integrated tools available that will provide more functionality than groveoa.dll. TechnoTeacher (<http://www.techno.com>) will soon be releasing the first commercial version of their GroveMinder product, which has a Python binding (in addition to other language bindings). This will give you much more functionality in a commercial tool--multiple groves, grove persistence, groves from different data types, HyTime support, etc. Paul Prescod, also of ISOGEN, is putting together a "PyGrove" package that interfaces SP's grove support directly into Python, making it available on all platforms, not just Windows. This will be a free tool from ISOGEN. I'm in the process of re-implementing as much of my PHyLIS HyTime engine <http://www.phylis.com> in Python as I can (probably everything but the user interface). These tools will provide a variety of grove- and Python-based processing options of varying price and functionality in the very near future. [One of our goals with all this is to demonstrate the plug-and-play nature of groves by providing a common data abstraction that can be implemented in a number of ways.]

Another note: because the Balise product is itself COM/ActiveX based, you could use the same techique shown above to use it from Python as well. This might be very interesting too, although I haven't tried it myself.

Dr. Macro says check it out.


Note: See "XML and Python."


<Address HyTime=bibloc homepage="http://www.drmacro.com">
W. Eliot Kimber, eliot@isogen.com
Senior SGML Consulting Engineer, Highland Consulting
2200 North Lamar Street, Suite 230, Dallas, Texas 75202
+1-214-953-0004 +1-214-953-3152 (fax)
http://www.isogen.com (work)</Address>