[This local archive copy is from the official and canonical URL, http://www.perlxml.com/faq/perl-xml-faq.html; please refer to the canonical source document if possible.]
by Jonathan Eisenzopf
This FAQ was generated using a small Perl script and an XML file. The script can be slurped from http://www.perlxml.com/faq/xmlfaq.pl. The XML source is located at http://www.perlxml.com/faq/perl-xml-faq.xml. To generate the Perl XML FAQ, run perl xmlfaq.pl perl-xml-faq.xml which prints the HTML to STDOUT.
Yes, there are several, but the most popular one is the XML::Parser module. Originally developed by Larry Wall, Clark Cooper now maintains the XML::Parser module. The module is a Perl wrapper around Expat, a non-validating parser written in C by James Clark. The module can be found on any CPAN server or on Clark's home page at http://www.netheaven.com/~coopercc/xmlparser/intro.html. The distribution includes Expat, so you don't have to worry about installing it separately. More information on Expat is available at http://www.jclark.com/xml/expat.html. Clark Cooper has also written a nice intro to XML::Parser which is available at http://www.xml.com/xml/pub/98/09/xml-perl.html.
In some cases, you may want to utilize regular expressions to manipulate XML. REX, written by Rob Cameron, is a fairly complete shallow parser written in Perl. Information on Rex can be found at http://www.cs.sfu.ca/~cameron/REX.html.
XML::DOM is an implementation of World Wide Web Consortium's (W3C) Document Object Model (DOM) Level 1. DOM defines an interface for working with an XML tree and XML::DOM is an implementation of DOM that works with an in-memory tree of XML nodes. All DOM implementations (in Perl or other languages) all use a similar interface and code written using one DOM implementattion should work with other DOM implementations. This portability allows you to pick a DOM implementation that has the features you need (memory usage, implementation language, database, etc.).
XML::Grove uses Perl hashes and arrays to store XML objects allowing you to use regular Perl hash and array functions to work with the tree of XML nodes. XML::Grove is based on ``property sets'' as described in the International Organization for Standardization (ISO) HyTime and DSSSL standards.
Using XML::DOM will allow you to more easily port your code from or to other languages or use other DOM modules [XML::DOM is the only implementation currently available to be used from Perl]. XML::Grove has a simpler Perlish interface. Briefly reading the XML::DOM and XML::Grove pod documentation may help you choose which module to use. Many modules work with both DOM and groves[*], but you should check the module documentation for compatibility issues.
The XML declaration in your XML file is incorrect. Despite those who seem to think that everything in XML is case insensitive, this is in fact not the case.
The declaration must be lowercase and contain the version number (also lower case). It should look
like this:
<?xml version='1.0'?>
You may alternatively specify the language encoding and declare whether the document is standalone:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
NOTE: You can use single or double quotes for attribute values in the XML declaration.
This error usually occurs when using XML::Parser in conjunction with DBI.
This is not a bug in XML::Parser or DBI, but a bug in
Perl itself. You should upgrade to DBI version 1.05 or greater or simply load the FileHandle module like:
use FileHandle;
Normally, the XML::Parser module will immediately terminate when it finds mal-formed XML.
This is, in fact, the way XML parsers should behave.
There are cases however, where you may want to handle the error without exiting the program. In these cases,
you can enclose the code that calls the parse()
or parsefile()
methods in an eval
block like:
eval { $p->parse($xml) };
or like:
eval { $p->parsefile($filename) };
If an error occurs, it puts the error message into the $@
variable. Below is a short script that
parses an XML file. It encloses the parsefile()
method in an eval
block and then
prints the error message if an error occured.
use strict; use XML::Parser; my $p = new XML::Parser(); die "catch_error.pl\n" unless $ARGV[0] && -e $ARGV[0]; eval { $p->parsefile($ARGV[0]) }; print "Caught error: $@\n" if $@; print "Done.\n";
Yes, the original_string
method, which is available in version 2.19 or later, returns strings in their
original encoding. The only drawback is that it will disable entity expansion. Also, you cannot use this method if you
are using the XML::Parser::ExpatNB object, which was added in version 2.22.
You can read multiple documents from a stream by using the parse_start
method in place of of parse
or parse_file
, which creates a new instance of
XML::Parser::ExpatNB. Multiple documents are parsed by making successive calls
to the parse_more
method. Calling the parse_done
method signifies that
you have are done processing the document.
You can filter out the whitespace in your text handler:
sub text { my ($xp, $data) = @_; return if ($ignorable_whitespace{$xp->current_element} and $data =~ /^\s*$/m); # Rest of processing ... }
This is a bug in version 2.20 of the XML::Parser module. Try upgrading to a newer version.
If you're using the Perl distribution that came with Linux RedHat-5.2, you will want to upgrade to a newer version of Perl. Redhat accidentally included a buggy version in their 5.2 Linux distribution.
Yes, Eric Prud'hommeaux has developed the W3C::Rdf::RdfParser which relies on the Perl implementation of SAX mentioned in question #7. It's available at http://www.w3.org/1999/02/26-modules/.
For starters, if you are using Apache, you should probably install mod_perl, available at http://perl.apache.org. This will eliminate the time it normally takes to load the Perl interpreter and any modules you are using. If you still require speed, you might consider using Data::Dumper or Storable to dump the XML::Parser object to disk. This eliminates the time required to re-parse an XML document.
Copyright (c)1998 Jonathan Eisenzopf. All rights reserved. Permission is hereby granted to freely distribute this document provided that all credits and copyright notices are retained.