[This local archive copy is from the official and canonical URL, http://www.darmstadt.gmd.de/~inim/perl_xml_survey.html; please refer to the canonical source document if possible.]


Ways to Rome: Processing XML with Perl

by Ingo Macherius, <macherius@gmd.de>
Version: $VERSION: Di 23.02.1999 8:02:20,12$

Legal Disclaimer: This paper, especially the source code included, is copyrighted material. © 1999 by GMD-IPSI. You are free to use and copy it for personal education, but not to publicly display the article or parts of it in other media (www, print, ...) or distribute it with commerical software. The use of code examples is at your own risk.


One of Perl's key features is: things can be done more then one way. This holds when processing XML using Perl. This brief tutorial solves a simple task again and again using different, XML-related CPAN modules and programming styles. The candidates are:

Software Version Source Remark
Perl regular expressions 5.005_02    
XML::Parser 2.19 http://wwwx.netheaven.com/~coopercc/xmlparser/intro.html using Handlers
XML::Parser 2.19 http://wwwx.netheaven.com/~coopercc/xmlparser/intro.html using Subs-style
XML::DOM 1.15 http://www.erols.com/enno/dom/ using standard methods only
XML::XQL 0.59 http://www.erols.com/enno/xql/ older versions won't work

This tutorial was done for a talk on the German Perl workshop 1.0 on February 17th, 1999 in Bonn. The focus in on the code examples, not the explanatory text. All code is tested and should work "cut-and-paste" if you have the above modules installed and a copy of REC-XML in your working folder.

1 The Task

The task is to filter the REC-xml-19980210.xml specification for grammar productions. They are contained in special markup, a typical one that looks like this:

...
<prod id="NT-PubidLiteral"><lhs>PubidLiteral</lhs>
<rhs>'"' <nt def='NT-PubidChar'>PubidChar</nt>* 
'"' 
| "'" (<nt def='NT-PubidChar'>PubidChar</nt> - "'")* "'"</rhs>
</prod>
...

So a grammar rule consists of a production, which consists of a right-hand-side and a left hand side. In the right-hand-side markup for describing hyperlinks between productions may be contained, so it is mixed content. The DTD fragment for this is:

<!ELEMENT prod (lhs, (rhs, (com|wfc|vc)*)+)>
<!--    ID attribute:
        The production must have an ID so that cross-references
        (specref) and mentions of nonterminals (nt) can link to
        it. -->
<!ATTLIST prod
        %common-idreq.att;>

<!ELEMENT lhs (#PCDATA)>
<!ATTLIST lhs %common.att;>

<!ELEMENT rhs (#PCDATA|nt|xnt|com)*>
<!ATTLIST rhs %common.att;>

I'm not aware of a place where the spec.dtd can be officially downloaded. It is, however, contained in archive for the first release of XML::Parser on Larry Walls server. As XML in Perl is not DTD aware, this will not hurt. Please note, that in the real specification document only <nt> is used, <xnt> and <com> won't happen and are not processed by the examples below.

The example programs will all produce a standard EBNF representation of the 89 productions contained in the XML specification, like this:

...
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
[13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
...

So, here we go with the code ...

2.1 Perl regular expressions

As the name "Practical Extraction and Report Language" implies, Perl was used for extraction of data long time before the advent of XML. The tool of choice for this task are regular expressions. The code below just follows the well known trail. It slurps in the file and iterates over regular expressions until the grammar text is extracted. Please note that the code heavily exploits the "non-greedy" option for regular expressions.

This is a none-example. Don't really use regular expressions on XML, in the long run you will be bitten.

# Using Using Perl regular expressions
# (c) 1999 GMD-IPSI, all rights reserved
# Author: Ingo Macherius <macherius@gmd.de>
require IO::File;

my $fh = new IO::File("REC-xml-19980210.xml", "r");
while (!$fh->eof()) { $doc .= $fh->getline; }

foreach( $doc =~ m@<prod.*?>.*?</prod>@gsc ) {
	push @prods, $_;
}

$counter = 0;
foreach ( @prods ) {
	s/\xc2\xa0/ /sg;
	s/[\r\n]/ /sg;
	
	m@<lhs>(.*?)</lhs>@;
	$lhs = $1;
	$lhs =~ s/\s\s/ /sg;

	m@<rhs>(.*?)</rhs>@;
	$rhs = $1;
	$rhs =~ s@<nt.*?>@@sg;
	$rhs =~ s@</nt.*?>@@sg;
	$rhs =~ s/\s\s/ /sg;
	print "[" . ++$counter . "] "
		  . $lhs
		  . " ::= "
		  . $rhs
		  . "\n";	
}

2.2 XML::Parser and Handlers

This solution rides XML::Parser in the most simple possible form, using handlers. This is very close to the original Expat API, so the code should be (and indeed is) fast, even faster than the regular expressions in solution 2.1. This is because we are actually processing an event stream. No in-memory representation of the XML document or the results is ever built. The control flow in event based programming is weird and repetetive, so this API is not suited for the casual programmer. People used to think in state machines and automata, however, will be very happy using handlers.

# Using XML::Parser and Handlers
# (c) 1999 GMD-IPSI, all rights reserved
# Author: Ingo Macherius <macherius@gmd.de>
use XML::Parser;
use strict;

my $parser = new XML::Parser(Handlers => 
	{ Start => \&tag_start,
	  End   => \&tag_end,
	  Char  => \&characters,
	  Init  => \&init
	});

$parser->parsefile('REC-xml-19980210.xml');

sub tag_start {
	my ($xp, $el) = @_;

	if ($el eq 'rhs') {
		$main::in_rhs = 1;
	} 
	elsif ($el eq 'lhs') {
		$main::in_lhs = 1;
	}
	elsif ($el eq 'prod') {
		$main::rhs = undef;
		$main::lhs = undef;
	}
}

sub tag_end {
	my ($xp, $el) = @_;
	if ($el eq 'rhs') {
		$main::in_rhs = undef;
	} 
	elsif ($el eq 'lhs') {
		$main::in_lhs = undef;
	}
	elsif ($el eq 'prod') {
		print_production();
	}
}

sub characters {
	my ($xp, $txt) = @_;
	if ($main::in_lhs) {
		$main::lhs .= $txt
	}
	elsif ($main::in_rhs) {
		$main::rhs .= $txt
	}
}

sub init {
$main::counter = 0;
}

sub print_production {
	my $prod =
	"[" . ++$main::counter . "] "
	. $main::lhs
	. " ::= "
	. $main::rhs
	;
	$prod =~ s/\xc2\xa0/ /sg;
	$prod =~ s/[\r\n]/ /sg;
	$prod =~ s/\s\s/ /sg;
	print $prod . "\n";
}

2.3 XML::Parser and the Subs-style

The "Subs"-style of XML::Parser is syntactic shugar for the bare handler interface. It auto-generates handlers for element-events and maps them to Perl subroutines. So the logic of example 2.3 is used unchanged, but as the control-flow is managed by the module, all those "if" statements have vanished. Characters must still be handled by your own handler.

# Using XML::Parser and the Subs-style
# (c) 1999 GMD-IPSI, all rights reserved
# Author: Ingo Macherius <macherius@gmd.de>
use XML::Parser;
use strict;

my $parser = new XML::Parser('Style' => 'Subs' );
   $parser->setHandlers('Char', \&characters);

$main::counter = 0;

$parser->parsefile('REC-xml-19980210.xml');

sub rhs   { $main::in_rhs = 1 }
sub lhs   { $main::in_lhs = 1 }
sub prod  { $main::rhs = undef;
		    $main::lhs = undef;
		  }
sub rhs_  { $main::in_rhs = undef }
sub lhs_  { $main::in_lhs = undef }
sub prod_ { print_production(); }

sub characters {
	my ($xp, $txt) = @_;
	if ($main::in_lhs) {
		$main::lhs .= $txt
	}
	elsif ($main::in_rhs) {
		$main::rhs .= $txt
	}
}

sub print_production {
	my $prod =
	"[" . ++$main::counter . "] "
	. $main::lhs
	. " ::= "
	. $main::rhs
	;
	$prod =~ s/\xc2\xa0/ /sg;
	$prod =~ s/[\r\n]/ /sg;
	$prod =~ s/\s\s/ /sg;
	print $prod . "\n";
}

2.4 XML::DOM

The DOM is W3Cs standard API for XML. The Perl implementation (well, any implementation of the DOM I know to be exact) offers a bunch of convenience methods. In the example below none of them are used, just to keep things clean. The striking impression using XML::DOM is, that the programming style is very "unperlish". Length based for-loops have to be used instead of Perl's convenient foreach-iterator. Clumsy nodeLists are necessary where native Perl lists would do a great job. This is the price of interoperability, I presume.

As XML::DOM has to build an in memory representation of the document, there is a huge impact on execution time. The example takes 8 times longer then the slowest event-based approach. There may be optimization to reduce this, but building a DOM will cost.

Another striking point is the repetitive use of iteration over a node's children. This is a well known "feature" of DOM-code, which will hopefully go away in DOM level 2.

# Using XML::DOM
# (c) 1999 GMD-IPSI, all rights reserved
# Author: Ingo Macherius <macherius@gmd.de>
require XML::DOM;
use strict;

my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("REC-xml-19980210.xml");

my $nodes = $doc->getElementsByTagName ("prod");

for (my $i = 0; $i < $nodes->getLength(); $i++) {
	 my $node = $nodes->item ($i);
	 my $lhs = $node->getElementsByTagName("lhs")->item(0);
	 my $rhs = $node->getElementsByTagName("rhs")->item(0);
	 
     print 
	   "[" . ($i+1) . "] " 
	   . $lhs->getFirstChild->getNodeValue()
           . " ::= "
           . rhs($rhs)
	   . "\n";
}

sub rhs {
	my $nodes = @_[0]->getChildNodes();
	my $text;
	
	for (my $i = 0; $i < $nodes->getLength(); $i++ ) {
		my $node = $nodes->item ($i);
		if ( $node->getNodeType() == XML::DOM::Node::ELEMENT_NODE()) {
			$text .= $node->getFirstChild()->getNodeValue()
		} else {
			$text .= $node->getNodeValue()
		}
	}
	$text =~ s/\xc2\xa0/ /sg;
	$text =~ s/[\r\n]/ /sg;
	$text =~ s/\s\s/ /sg;
	return $text;
}

2.5 XML::XQL

XML::XQL is the highest level module available for processing XML in Perl. This results in the shortests code - and the longest execution time. A nice feature of XML::XQL is the fact that all the results are XML::DOM nodes, so the results of a query can be processed further using DOM methods. While developing this example I triggered a few bugs in the module, but Enno Derksen fixed them very quickly. So if XML::XQL matures, it can be a very powerful tool for Perl programmers.

An interesting result is, that this code is only 15% slower then the XML::DOM example. I'd have expected a lot more overhead, especially given the open and extensible way the XML::XQL module works.

# Using XML::XQL
# (c) 1999 GMD-IPSI, all rights reserved
# Author: Ingo Macherius <macherius@gmd.de>
use XML::XQL;
use XML::XQL::DOM;
use strict;

my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ("REC-xml-19980210.xml");

my @lhs = XML::XQL::solve ('//prod/lhs!text()', $doc);
my @rhs = XML::XQL::solve ('//prod/rhs!text()', $doc);

for (my $i = 0 ; $i <= $#lhs ; $i++ ) {
	print "[" . ($i+1) . "] "
	  . $lhs[$i]->xql_toString()
	  . " ::= "
	  . $rhs[$i]->xql_toString()
	  . "\n";
}

3 Comparison

Here are the timings for running the different versions:

Example Timing (DProf) Factor
Perl regular expressions 1020 137 %
XML::Parser (Handlers) 741 100 %
XML::Parser (Subs) 1080 145 %
XML::DOM 8570 1156 %
XML::XQL 10120 1366 %

Filtering a document is not a very fair task for the tree-based modules XML::DOM and XML::XQL. They are only worth the overhead for tree construction in sufficient complex tasks, which filtering is not. The ease of XQL may lead to its use in "quick-n-dirty" single use applications, where execution time does not matter. XML::DOM seems to unperlish to attract a large crowd of Perl hackers.

My personal impression using Perl with XML is good: The base modules are maturing, and the documentation is very useable. Given Unicode in core Perl, it can continue to be the web hackers tool of choice in open source software.


© 1999 by GMD-IPSI and Ingo Macherius, all rights reserved.