A research team at the University of Washington Department of Computer Science and Engineering has developed an XML toolkit using a highly scalable XML stream processor and an XML stream index. The XML toolkit "consists of a number of utilities that perform simple operations on XML files. [The team has built] a sort utility, an aggregation utility, a mapping utility to transform Unix directory hierarchies to XML, and some smaller utilities. The utilities can be combined sequentially, in pipes, to perform more complex XML processing. The toolkit defines a binary XML format that is more compact than XML (by a factor of two, on average). This format can be optionally used as communication format in a pipeline, to achieve speedups by a factor of two. The toolkit defines a novel index called stream index, SIX. All utilities check if the input data has a pre-computed SIX, and use it if it is available... From a technical point of view, it is essential for XML Toolkit commands to parse XML data and process a collection of XPath expressions. As a common library for these command, the team has implemented two technologies to realize a high-throuput XML data processing: (1) Lazy DFA-based XPath processor, a deteministic finite automaton that is constructed lazily, and (2) 'SIX' streaming index for XML data for the XML parser. They have implemented the Lazy DFA and SIX on a tokenized (binary) SAX parser and xerces1_4_0 SAX parser." The XML Toolkit library is available for download.
"There are two important technical contributions in the toolkit: a highly scalable XML stream processor, and an XML stream index. Both have important applications beyond the toolkit, to XML packet routing and to selective dissemination of information. The XML stream processor achieves a sustained throughput of about 5.6MB/s and scales up to large numbers of XPath expressions (up to 1,000,000 in our experiments). Since the processor transforms all XPath expressions into a single deterministic automaton, we did a detailed theoretical analysis to prove that the number of states in the automaton remains relatively small. The stream index is the first attempt to index streaming data, rather than stored data. It has to be small compared to the data stream (ours is 6% or less), and has to arrive just in time with the data stream. We measured speedups of up to a factor of four in processing XML streams with the index... [from the website]
From the PLAN-X Workshop paper: "We describe a toolkit for highly scalable XML data processing, consisting of two components. The first is a collection of stand-alone XML tools, s.a. sorting, aggregation, nesting, and unnesting, that can be chained to express more complex restructurings. The second is a highly scalable XPath processor for XML streams that can be used to develop scalable solutions for XML stream applications. In this paper we discuss the tools, and some of the techniques we used to achieve high scalability. The toolkit is freely available as an open-source project. Each of the tool stand-alone XML tools performs one single kind of transformation, but can scale to arbitrarily large XML documents in, essentially, linear time, and using only a moderate amount of main memory. There is a need for such tools in user communities that have traditionally processed data formatted in line-oriented text files, such as network traffic logs, web server logs, telephone call records, and biological data. Today, many of these applications are done by combinations of Unix commands, such as grep, sed, sort, and awk. All these data formats can and should be translated into XML, but then all the line-oriented Unix commands become useless. Our goal is to provide tools that can process the data after it has been migrated to XML. Our second goal is to study highly efficient XML stream processing techniques. The problem in XML stream processing is the following: we are given a large number of boolean XPath expressions and a continuous stream of XML documents and have to decide, for each document, which of the XPath expressions it satisfies. In stream applications like publish/subscribe or XML packet routing this evaluation needs to be done at a speed comparable with the network throughput, and scale to large numbers of XPath expressions... We report here one novel technique for stream XML processing called Stream IndeX, SIX, and describe its usage in conjunction with the stand-alone tools. A SIX for an XML file (or XML stream) consists of a sequence of byte offsets in the XML file that can be used by the XPath processor to skip unneeded portions. When used in applications like XML packet routing, the SIX needs to be computed only once for each packet, which can be done when the XML packet is first generated, then routed together with the packet... The work closest to our toolkit is LT XML. It defines a C-based API for processing XML files, and builds a large number of tools using this API. Their emphasis is on completeness, rather than scalability: there is a rich set of tools for searching and transforming XML files, including a small query processor..."
Principal references:
- XML Toolkit for Light-weight XML Stream Processing . Website presentation.
- xmltk on SourceForge, and SourceForge Project
- XML Toolkit download
- XML-TK Binary format specification
- "XML Stream Processing." By Dan Suciu. 25 slides. April, 2002. 'The problem: Given [1] Large number of Xpath expressions and [2] Incoming stream of XML documents, Decide for each document which expressions it matches..." [cache, PDF]
- Tutorial: XML Toolkit. By Makoto Onizuka. Draft only, V1.0. August 12, 2002. "The XML toolkit is a framework for a light-weight and high performance XML data stream processing. It is comprised from fundamental libraries (XML SAX parser, XPath processor) and a collection of simple XML processing commands (that correspond to plain text processing commands like cat, head, tail, sort, grep and so forth) built upon those libraries. Section 2 explains the XML toolkit commands and Section 3 explains the XML toolkit libraries..."
- "XMLTK: An XML Toolkit for Scalable XML Stream Processing." Draft version 2002-07. 13 pages. By Iliana Avila-Campillo (Institute for Systems Biology), Todd Green (Xyleme), Ashish Gupta (University of Washington), Makoto Onizuka (NTT Cyber Space Laboratories, NTT Corporation ), Demian Raven (University of Washington) and Dan Suciu (University of Washington). Paper prepared for presentation at the PLAN-X Workshop on Programming Language Technologies for XML, October 3, 2002, Pittsburgh, PA, USA.
- "Processing XML Streams with Deterministic Automata and Stream Indices." By Gerome Miklau, Todd J. Green, Makoto Onizuka, and Dan Suciu. Abstract for the VLDB 2002 paper.
- XML Toolkit mailing list