[Cache version from http://mike.hostetlerhome.com/present_files/pyxml.html]
Presented on Sept 5, 2006 to the Omaha Dynamic Languages Group
This work is licensed under a Creative Commons Attribution-Share Alike 2.5 License.
From The Python Library Reference
"xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller."
From http://effbot.org/zone/element-index.htm
"The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary. The ElementTree wrapper adds code to load XML files as trees of Element objects, and save them back again."
xml.etree
From http://effbot.org/zone/celementtree.htm
"The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory. On modern hardware, that means that documents in the 50-100 megabyte range can be manipulated in memory, and that documents in the 0-1 megabyte range load in zero time (0.0 seconds). This allows you to drastically simplify many kinds of XML applications."
Other libraries you may see mentioned on the Web:
In the following examples, the minidom code is on the top and the cElementTree code is on the bottom.
(at least how I do it)
from xml.dom import minidom import cElementTree as ET
roottag="<tag/>" newdoc=minidom.parseString(roottag) etElement=ET.Element("tag")
newtag = newdoc.createElement("newtag") newdoc.documentElement.appendChild(newtag) newElement=ET.SubElement(etElement,"newtag")
newtag.setAttribute("name","value") newElement.set('name','value')
newtag.appendChild(newdoc.createTextNode("text value")) newElement.text="text value"
newdoc2=minidom.parseString(roottag) newtag=newdoc2.childNodes[0] newtag2=newdoc.importNode(newtag,deep=1) newtag2 = newdoc.documentElement.appendChild(newtag2) newElement=ET.Element("tag") etElement.append(newElement)
newtag2.parentNode.removeChild(newtag2) etElement.remove(newElement)
for tag in newdoc.getElementsByTagName("newtag"): print tag.getAttribute("name") for tag in etElement.find("newtag"): print tag.get("name")
print newdoc.toxml() print ET.tostring(etElement)
<?xml version="1.0" ?> <tag><newtag name="value">text value</newtag></tag> <tag><newtag name="value">text value</newtag></tag>
ElementTree can use basic XPath queries to find Elements in the path.
The find
method will find the first Element that matches the XPath query
The findall
will find all the items that match the query.
The findtext
method will find the first tag matching the XPath query
and return it's text.
print et.find("/wpt/desc").text for t in et.findall("/wpt/desc"): print t.text print et.findtext("/wpt/desc")
Namespaces are special in ElementTree. If you have namespaces defined, you must always use it, even if it's the default namespace, i.e. a namespace defined as:
<gpx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.topografix.com/GPX/1/0" xsi:schemaLocation="http://www.topografix.com/GPX/1/0">
Here is a sample of how you have to access it:
>>> et=ET.parse("everything.gpx") >>> et.find("wpt") >>> print et.find("wpt") None >>> print et.find("{http://www.topografix.com/GPX/1/0}wpt") <Element '{http://www.topografix.com/GPX/1/0}wpt' at 0xe3f8>
I like to use string templates for this. See next example.
GPX is the format that comes out of a GPS. Here is a script that
takes a GPX file and outputs lat,lon,wayptname
import sys,os import cElementTree as ET import string if __name__ == '__main__': mainNS=string.Template("{http://www.topografix.com/GPX/1/0}$tag") wptTag=mainNS.substitute(tag="wpt") nameTag=mainNS.substitute(tag="name") et=ET.parse(open("everything.gpx")) for wpt in et.findall("//"+wptTag): wptinfo=[] wptinfo.append(wpt.get("lat")) wptinfo.append(wpt.get("lon")) wptinfo.append(wpt.findtext(nameTag)) print ",".join(wptinfo)
39.655717000,-104.902083000,GCRQRZ 39.568783000,-104.913300000,GCRRCK 39.556767000,-104.874400000,GCRRHG 39.660650000,-104.762467000,GCRRQ5 39.664640000,-104.764720000,GCRWHG 39.572367000,-104.912567000,GCRWV5 39.705883000,-104.778600000,GCRZ5V 39.709617000,-104.786800000,GCRZ5X 39.566450000,-104.889233000,GCT2BP ....
restconnect
is a module I wrote to abstract a REST
webservice API. Instead of creating the XML yourself, the
RestConnect
class creates the XML from properties, and
assigns the result to another property. Here is an example:
### create the Geocode class class Geocode(RestConnect): def __init__(self): RestConnect.__init__(self, "http://api.local.yahoo.com/MapsService/V1/geocode?", "urn:yahoo:maps") self.appid='xxxxx'
## Use Geocode if __name__=='__main__': g= Geocode() if len(sys.argv)<2: g.city="Omaha" g.State="NE" g.Street="14620 Frances Cir" g.zip="68144" else: g.location='' for x in sys.argv[1:]: g.location+="%s " %x g.fetch() print g.Latitude,g.Longitude
def _parse(self,xmlstr): dom = minidom.parseString(xmlstr) result = dom.getElementsByTagName("Result")[0] for child in result.childNodes: if child.firstChild: self.__dict__[child.tagName] = child.firstChild.data del dom
def _parse(self,xmlstr): et = ET.parse(xmlstr) if self._namesp: namesp=string.Template("{%s}$tag" %self._namesp) else: namesp=string.Template("%tag") resultTag=namesp.substitute(tag="Result") result=et.find(resultTag) for child in list(result): if child.tag.find("}")>-1: tagname=child.tag[child.tag.find("}")+1:] else: tagname=child.tag self.__dict__[tagname] = child.text del et