Simon St.Laurent (O'Reilly & Associates) has released a Java SAX Filter called 'Regular Fragmentations' which uses regular expressions to fragment content into XML elements. "Regular fragmentations are an approach to processing textual content as if it had been represented as more finely-grained markup. The XML Schema Dataypes specification, for instance, offers a number of lexically compound types among its primitive types, requiring developers to rely on extension functions or XML Schema processing to manipulate them with XSLT. Regular fragmentations allow developers to specify the application of regular expression to element content (attribute content coming soon!) using an XML-based rules syntax. An open source SAXFilter implementation allows the use of regular fragmentations in a wide variety of XML processing environments... XML developers are constantly faced with questions about how fine-grained their data structures should be, and the difficult problem of dealing with cases where other people chose coarse-grained structures. While tools like XSLT can do an excellent job retrieving needles from haystacks, it's much easier to extract needles that are labelled and cleanly separated from the surrounding content. The com.simonstl.fragment package allows developers to specify rules using regular expressions which are applied to element content during the parsing process. While the document is parsed, those rules are applied to the textual content of the specified elements and new child elements are created, adding extra markup information to the document."
"Regular Fragmentations: Treating Complex Textual Content as Markup." By Simon St. Laurent (O'Reilly & Associates). Paper to be presented at Extreme Markup Languages 2001, August 12-17, 2001, Montréal, Canada. "Regular fragmentations are an approach to processing textual content as if it had been represented as more finely-grained markup. The XML Schema Dataypes specification, for instance, offers a number of lexically compound types among its primitive types, requiring developers to rely on extension functions or XML Schema processing to manipulate them with XSLT. Regular fragmentations allow developers to specify the application of regular expression to element content (attribute content coming soon!) using an XML-based rules syntax. An open source SAXFilter implementation allows the use of regular fragmentations in a wide variety of XML processing environments."
Principal references:
- Project Announcement [June 24, 2001 and July 02, 2001]
- Regular Fragmentations web site
- Documentation
- Atoms/Molecules (background)
- simonstl.com