Date: Sun, 24 Jun 2001 17:07:45 -0400 From: "Simon St.Laurent" <email@example.com> To: firstname.lastname@example.org Subject: ANN: Regular Fragmentations
Back in April I suggested that regular expressions might be a useful tool for fragmenting XML 'molecule' content into smaller pieces which could then be processed as 'atoms':
I've finally found the time to put together an implementation of this approach, building a SAX2 filter which uses an XML configuration file and the regular expression support built into the Xerces parser. As content passes through the filter, elements identified by the configuration file are processed and broken down into smaller elements using rules built on regular expressions.
This filter is written in Java (1.3) and requires the Xerces parser. I've released it under the Mozilla Public License (MPL) and plan to continue developing it in the directions noted in the documentation. This release is version 0.02 and I don't make extensive claims for its stability, though it works quite well on the tests I've fed it.
The regular expression package in the Xerces parser is largely compliant with the regular expression language defined in Appendix F of XML Schema Part 2: Datatypes. (I'm still trying to determine how much this implementation differs from other regular expression approaches, but my experiments are only really getting started.) You can use the recursive feature built into the processor to perform multiple-level fragmentation if necessary.
The "Regular Fragmentation" package is available from:
Documentation is still primarily javadoc, though an overview provides examples and some explanation. A list of planned improvements is at the end of the overview, and probably the most notable improvement planned is support for attribute content and content identification. Currently only element content is processed, and the rules only support identification through element names. (It is namespace-aware.)
Comments, suggestions, and contributions are welcome, either privately or to the xml-dev mailing list.
O'Reilly & Associates
Date: Mon, 02 Jul 2001 18:46:43 -0400 From: "Simon St.Laurent" <email@example.com> To: firstname.lastname@example.org Subject: update on Regular Fragmentations
The last week has been a good one for Regular Fragmentations:
There's still a long way to go - more on that later - but I've managed to implement the core functionality I wanted to provide.
Developers can specify regular expressions for either $0, $1, etc. matching or for delimiter-based splitting. In both cases, results can be matched into a set of rules which can (at rule option) repeat if necesary. Recursive processing of results is possible as well.
There are some large steps yet to be taken before the code is really ready for the world, however. The most obvious problem is that I haven't yet supported attributes, either as targets or as results. Doing so is taking some substantial refactoring from a stream-oriented model to a partially tree-oriented model. The code remains pretty fragile right now, subject to null pointer exceptions if things happen like more results than result rules.
It's improving but it will be a little while - it's only at version 0.07, after all. For right now, it's probably wise for it to remain a one-programmer endeavor. As things clean up, it should be easier for people to make contributions. I'm hoping to set up some unit testing options as well, though I'm still figuring that out.
If you want to get some idea what this looks like, there are some examples in the javadoc overview at:
In any case, I just wanted to let people know that the work is happening and that simple things can prove pretty powerful.
Prepared by Robin Cover for The XML Cover Pages archive.