Python and XML

By Mike Hostetler

[Cache version from http://mike.hostetlerhome.com/present_files/pyxml.html]


Presented on Sept 5, 2006 to the Omaha Dynamic Languages Group

This work is licensed under a Creative Commons Attribution-Share Alike 2.5 License.

What we will cover

minidom

From The Python Library Reference

"xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller."

Good:

Bad:

ElementTree

From http://effbot.org/zone/element-index.htm

"The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary. The ElementTree wrapper adds code to load XML files as trees of Element objects, and save them back again."

ElementTree

Good

Bad

cElementTree

From http://effbot.org/zone/celementtree.htm

"The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory. On modern hardware, that means that documents in the 50-100 megabyte range can be manipulated in memory, and that documents in the 0-1 megabyte range load in zero time (0.0 seconds). This allows you to drastically simplify many kinds of XML applications."

cElementTree

Good

Bad

Other Libraries

Other libraries you may see mentioned on the Web:

PyXML

4Suite

Other Libraries

libxml2

Examples

In the following examples, the minidom code is on the top and the cElementTree code is on the bottom.

Example

Importing the Packages

(at least how I do it)

from xml.dom import minidom

import cElementTree as ET

Creating XML From Scratch

    roottag="<tag/>"
    newdoc=minidom.parseString(roottag)

    etElement=ET.Element("tag")

Example

Creating A New Element

    newtag = newdoc.createElement("newtag")
    newdoc.documentElement.appendChild(newtag)

    newElement=ET.SubElement(etElement,"newtag")

Adding Attributes

    newtag.setAttribute("name","value")

    newElement.set('name','value')

Example

Adding Text

    newtag.appendChild(newdoc.createTextNode("text value"))

    newElement.text="text value"

Importing From Another XML Tree

    newdoc2=minidom.parseString(roottag)
    newtag=newdoc2.childNodes[0]
    newtag2=newdoc.importNode(newtag,deep=1)
    newtag2 = newdoc.documentElement.appendChild(newtag2)

    newElement=ET.Element("tag")
    etElement.append(newElement)

Example

Removing Elements

    newtag2.parentNode.removeChild(newtag2)

    etElement.remove(newElement)

Iteration

    for tag in newdoc.getElementsByTagName("newtag"):
        print tag.getAttribute("name")

    for tag in etElement.find("newtag"):
        print tag.get("name")

Example

Printing Out

    print newdoc.toxml()

    print ET.tostring(etElement)

The Final Result

<?xml version="1.0" ?>
<tag><newtag name="value">text value</newtag></tag>

<tag><newtag name="value">text value</newtag></tag>

ElementTree-only stuff

XPath and Elementtree

ElementTree can use basic XPath queries to find Elements in the path.

The find method will find the first Element that matches the XPath query

The findall will find all the items that match the query.

The findtext method will find the first tag matching the XPath query and return it's text.

    print et.find("/wpt/desc").text

    for t in et.findall("/wpt/desc"):
        print t.text

    print et.findtext("/wpt/desc")

Namespaces

Namespaces are special in ElementTree. If you have namespaces defined, you must always use it, even if it's the default namespace, i.e. a namespace defined as:

<gpx
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.topografix.com/GPX/1/0" 
xsi:schemaLocation="http://www.topografix.com/GPX/1/0">

Namespaces

Here is a sample of how you have to access it:


>>> et=ET.parse("everything.gpx")
>>> et.find("wpt")
>>> print et.find("wpt")
None
>>> print et.find("{http://www.topografix.com/GPX/1/0}wpt")
<Element '{http://www.topografix.com/GPX/1/0}wpt' at 0xe3f8>

I like to use string templates for this. See next example.

Example -- GPX to CSV

GPX is the format that comes out of a GPS. Here is a script that takes a GPX file and outputs lat,lon,wayptname


import sys,os
import cElementTree as ET
import string

if __name__ == '__main__':

    mainNS=string.Template("{http://www.topografix.com/GPX/1/0}$tag")

    wptTag=mainNS.substitute(tag="wpt")
    nameTag=mainNS.substitute(tag="name")

    et=ET.parse(open("everything.gpx"))

    for wpt in et.findall("//"+wptTag):
        wptinfo=[]
        wptinfo.append(wpt.get("lat"))
        wptinfo.append(wpt.get("lon"))
        wptinfo.append(wpt.findtext(nameTag))

        print ",".join(wptinfo)
  

Example -- GPX to CSV (cont)

Result:

39.655717000,-104.902083000,GCRQRZ
39.568783000,-104.913300000,GCRRCK
39.556767000,-104.874400000,GCRRHG
39.660650000,-104.762467000,GCRRQ5
39.664640000,-104.764720000,GCRWHG
39.572367000,-104.912567000,GCRWV5
39.705883000,-104.778600000,GCRZ5V
39.709617000,-104.786800000,GCRZ5X
39.566450000,-104.889233000,GCT2BP
....

Example -- restconnect

restconnect is a module I wrote to abstract a REST webservice API. Instead of creating the XML yourself, the RestConnect class creates the XML from properties, and assigns the result to another property. Here is an example:

### create the Geocode class
class Geocode(RestConnect):

    def __init__(self):

        RestConnect.__init__(self,
                     "http://api.local.yahoo.com/MapsService/V1/geocode?",
                     "urn:yahoo:maps")

        self.appid='xxxxx'

Example -- restconnect (cont)


## Use Geocode
if __name__=='__main__':

    g= Geocode()
    if len(sys.argv)<2:

      g.city="Omaha"
      g.State="NE"
      g.Street="14620 Frances Cir"
      g.zip="68144"


    else:
        g.location=''
        for x in sys.argv[1:]:
            g.location+="%s " %x

    g.fetch()
    print g.Latitude,g.Longitude

Example -- restconnect (cont)

The minidom version:

    def _parse(self,xmlstr):

        dom = minidom.parseString(xmlstr)
        result = dom.getElementsByTagName("Result")[0]

        for child in result.childNodes:
            if child.firstChild:
                self.__dict__[child.tagName] = child.firstChild.data

        del dom

Example -- restconnect (cont)

The cElementtree version:

    def _parse(self,xmlstr):

        et = ET.parse(xmlstr)
        if self._namesp:
            namesp=string.Template("{%s}$tag" %self._namesp)

        else:
            namesp=string.Template("%tag")

        resultTag=namesp.substitute(tag="Result")
        result=et.find(resultTag)

        for child in list(result):
            if child.tag.find("}")>-1:
                tagname=child.tag[child.tag.find("}")+1:]
            else:
                tagname=child.tag

            self.__dict__[tagname] = child.text


        del et


Questions?