[This local archive copy is from the official and canonical URL, http://jfinity.com/muxml/; please refer to the canonical source document if possible.]


MuXML (Version 0.1)

An XML Document Multiplexor


MuXML is a prototype Perl module that implements configurable multiplexing of XML document streams accessed via the LWPng module and parsed using the XML::Parser module. Its returns its results using the XML::Grove module. Its primary purpose is to serve as a demonstration of the use of non-blocking design approaches with XML and Perl. If it also ends up being a useful tool then that's gravy :-).

You can find the source code here if you are interested in browsing.

One way to think about the problem domain that MuXML (and the underlying sub-systems) is addressing is the generalization of stream oriented document processing to the handling or more than one document stream at a time (lets call them a stream set). It turns out that in a single threaded environment, you don't even have to have more than one document in order to to need to employ the kinds of approaches used by LWPng and MuXML.

The various streams in a stream set may have very different throughput levels. This means that you may need to be able to throttle the fast streams in order to not overwhelm the slow ones. Single stream processing frameworks like those of SAX based XML parsers do not generalize to use with stream sets. The main reason is that single stream based frameworks do not support flow-control except in a cumbersome round-about way. MuXML can be viewed as a framework for processing stream sets with explicit flow control built in.

The most obvious use of MuXML is to multiplex record oriented XML document streams. I'll try to provide some demonstration data generated with something like XFlat/XML Convert in the next release.

One possible application of MuXML would be the emulation of the UNIX sort command's merge capability. Let's call this sample application MergeML. You would pass MergeML a list of XML documents, each of which was already sorted. It would output an XML document that interleaved the contents of the input documents based on same sorting criteria used for the internal sort.

Another sample application would be aggregating information from a distributed logging system. Let's say that you have a site that is replicated across multiple distinct servers. Each server is stand-alone and does its own logging. The servers create a nightly ordered listing of page accesses (in XML for arguments sake :-). Your MuXML application would access the hit logs and process them in parallel. It would output the aggregate access count across all the sites.


Usage

The MuXML conceptual model is that of a filtering smart multiplexor. You hand MuXML a list of URL. It accesses and parses them incrementally based on a blockSize that you specify. MuXML uses HTTP partial GET requests to allow flow control to be propagated to the resource servers. It will only get as much data from the server as it needs it in order to generate a fragment. Once it has one or more fragments queued for a particular stream, it will only get more data once the application has consumed the current fragments.

It filters out fragments of the incoming documents using a fragment recognizer you give it. Whenever it has a complete fragment from any of the document streams that it is processing, It calls the fragment multiplexor that you have provided. Below are description of these callbacks.

fragRecognizer

The fragRecognizer is invoked whenever a start tag is parsed AND we are not already inside a fragment for that stream. It returns a boolean value indicating whether this element should be made the root of a new fragment. It is passed the following parameters:

tag

The tag (GI) of the element.

attrs

An array of the attr/value pairs

fragMux

fragMux is called when the end tag of a recognized fragment in one of the document streams has been processed. MuXML and the fragMux communicate through arrays that contain an entry for each document stream in the stream set. I.e. if you passed MuXML three URL, the array would contain three entries.

FragMux is passed two arrays, one containing the state of the stream set on the previous call to fragMux and the other containing the current state. Each entry in the array(s) can have one of three values:

fragment

An XML::Grove::Element.

0

This indicates that there is no fragment available for this stream.

-1

This stream has reached EOF

It is passed the following parameters:

prevFrags

the fragments array (that was passed in as currFrags to the previous call to fragMux).

currFrags

the current fragments array. It will contain a single entry that is different from the corresponding entry in the prevFrags array. The application can consume zero or more of the fragments in the array. If it consumes a fragment, it must set the entry to 0.

The application may also choose to consume none of the fragments. For example if it requires all entries to be filled (i.e. a barrier). It is an error to not consume any entries if all entries are filled since then MuXML would stall.

fragError

fragError is called when an error occurs during processing. It is passed the following parameters:

error

an error string. The error string from Expat if it was a parsing error.

fragIndex

the index of the document whose processing caused the error.


LWPng

The LWPng Perl module provides an event-driven framework for interacting with web-based resources. LWPng is the next generation of the venerable libwww-perl (version 5) distribution. Both LWPng and LWP are written by the same author, Gisle Aas.

LWPng provides programmatic access to the HTTP/1.1 protocol. The 1.1 version of HTTP has various features that enable more scaleable interaction with web-based resources. MuxML doesn't make use of any of these HTTP/1.1 capabilities. Instead it makes use of the event-driven framework that LWPng provides in conjunction with partial HTTP GET request which were already available in HTTP/1.0.

XML::Parser

The XML::Parser Perl module provides both a low-level and a higer-level interface to James Clark's C language parser, Expat . The underlying C parser provides an incremental parsing interface by default. Up until version 2.22, XML::Parser did not expose this interface. Version 2.22 was made available by Clark Cooper (the author of all but the first few versions of XML::Parser) on April 4th, 1999.