Perl and XML at XML Dev Day

From Mon Mar 30 10:05:10 1998
Date: Mon, 30 Mar 1998 16:24:20 +0100
From: Andy Wardley <>
Subject: Supplied: XML Dev day Perl report

[Note: see also XML and Perl in the XML database.]

[ also CC'd to perl-unicode who may be interested ]

On Mar 28, 11:07am, Ken MacLeod wrote:
> I'd love to hear some more reports on the Perl and XML session from
> the XML Developer's Day.  I saw Tim Bray's quick note on the XML-Dev
> list.

I made a few notes.  Here's a brief summary.

Larry first acknowledged that "Perl <heart> XML" was in fact, incorrect
XML syntax and he attributed this to the fact that he new very little
about XML a month ago.  But he's learning....

He talked about the work required to get Perl working natively with

  * lexically scoped unicode awareness
  * positional commands (e.g. substr) which current assume char = byte
  * regexen
  * translation (this point escaped any more note-taking but I think it
    relates to case folding, etc)
  * character classes
  * pack & unpack
  * IO filters (explicit or implicit)
  * Unicode Perl programs?

...and said that UTF-8 can be used currently until Perl directly supports
native wide characters.

He said that a lot of work had been done in parsing XML docs and mentioned
that the 1.0 spec almost always talks about reading, rather than writing.
It was however, the writing of XML docs that he's been looking at in
particular.  As already reported on this mailing list, there are all sorts
of cases while you're generating or manipulating an XML doc that it is
in an illegal state.  Larry likens this to proteins that manipulate your
DNA - they have to break the rules to get the job done.  Larry can explain
this far better than I can.

The parser code that he's been working on is based around James Clark's
"expat".  He described it as SAX oriented but not yet compliant.  The
parser has a number of different "Styles" which are effectively the
different compiler backends used to produce the parsed XML tree in a
number of different formats: object-based, tree-based, debug style, data
dumper style, etc.

One existing problem is that expat does not give access to the "raw" XML
data, instead delivering the "cooked" version where XML comments (for
example) have been stripped.  The canonical test of any parser/generator
is to parse a document and then re-generate the same document to check
they match.  At the moment this is not possible, although James Clark
has agreed on a protocol to permit access to the raw XML that the core
parser sees.

Larry demo'd some of the code he's been working on.  Much of it was fleeting
so I got only a flavour.  One of the neatest things was the:

  use DTD "Bible";

Which parses the "Bible" DTD and then allows you to create a new bible
document instance on the fly.

  my $book = new Testament(...);


  my $bible = new XML::Element('Bible', { attr => val, ... }, [ content ]);

And so on.  There are more than two ways to do this.

All XML elements were XML::Element instances but I don't know if he has
any plans to sub-class them into specific element types.

That's about all I've got.  I've missed much of the detail, but hopefully
I caught some of the flavour.


. . .with respect to which Larry Wall noted:

"All pretty much correct except that last sentence.  The "new Testament"
is in fact a specific element type, and the Objects style does bless
into specific types.  (It doesn't supply any particular semantics,
or even an @ISA, but this is Perl, so they can be supplied elsewise,
presumably by "use DTD" or equivalent.)

"I'll be making my XML::Parser module and the examples available, pretty
much in the form I presented them.  I figure I ought to play with Unicode
for a while, so I should let y'all play with the XML and maybe turn it
into something useful.  :-)

"Anyway, give me an hour or three to scrape it all together.



To leave this list, send an email message to
with the following text in the body: Unsubscribe Perl-XML
For non-automated Mailing List support, send email to