The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: August 26, 2002.
News: Cover StoriesPrevious News ItemNext News Item

Stream Index 'SIX' Used in XML Stream Processing Toolkit.

A research team at the University of Washington Department of Computer Science and Engineering has developed an XML toolkit using a highly scalable XML stream processor and an XML stream index. The XML toolkit "consists of a number of utilities that perform simple operations on XML files. [The team has built] a sort utility, an aggregation utility, a mapping utility to transform Unix directory hierarchies to XML, and some smaller utilities. The utilities can be combined sequentially, in pipes, to perform more complex XML processing. The toolkit defines a binary XML format that is more compact than XML (by a factor of two, on average). This format can be optionally used as communication format in a pipeline, to achieve speedups by a factor of two. The toolkit defines a novel index called stream index, SIX. All utilities check if the input data has a pre-computed SIX, and use it if it is available... From a technical point of view, it is essential for XML Toolkit commands to parse XML data and process a collection of XPath expressions. As a common library for these command, the team has implemented two technologies to realize a high-throuput XML data processing: (1) Lazy DFA-based XPath processor, a deteministic finite automaton that is constructed lazily, and (2) 'SIX' streaming index for XML data for the XML parser. They have implemented the Lazy DFA and SIX on a tokenized (binary) SAX parser and xerces1_4_0 SAX parser." The XML Toolkit library is available for download.

"There are two important technical contributions in the toolkit: a highly scalable XML stream processor, and an XML stream index. Both have important applications beyond the toolkit, to XML packet routing and to selective dissemination of information. The XML stream processor achieves a sustained throughput of about 5.6MB/s and scales up to large numbers of XPath expressions (up to 1,000,000 in our experiments). Since the processor transforms all XPath expressions into a single deterministic automaton, we did a detailed theoretical analysis to prove that the number of states in the automaton remains relatively small. The stream index is the first attempt to index streaming data, rather than stored data. It has to be small compared to the data stream (ours is 6% or less), and has to arrive just in time with the data stream. We measured speedups of up to a factor of four in processing XML streams with the index... [from the website]

From the PLAN-X Workshop paper: "We describe a toolkit for highly scalable XML data processing, consisting of two components. The first is a collection of stand-alone XML tools, s.a. sorting, aggregation, nesting, and unnesting, that can be chained to express more complex restructurings. The second is a highly scalable XPath processor for XML streams that can be used to develop scalable solutions for XML stream applications. In this paper we discuss the tools, and some of the techniques we used to achieve high scalability. The toolkit is freely available as an open-source project. Each of the tool stand-alone XML tools performs one single kind of transformation, but can scale to arbitrarily large XML documents in, essentially, linear time, and using only a moderate amount of main memory. There is a need for such tools in user communities that have traditionally processed data formatted in line-oriented text files, such as network traffic logs, web server logs, telephone call records, and biological data. Today, many of these applications are done by combinations of Unix commands, such as grep, sed, sort, and awk. All these data formats can and should be translated into XML, but then all the line-oriented Unix commands become useless. Our goal is to provide tools that can process the data after it has been migrated to XML. Our second goal is to study highly efficient XML stream processing techniques. The problem in XML stream processing is the following: we are given a large number of boolean XPath expressions and a continuous stream of XML documents and have to decide, for each document, which of the XPath expressions it satisfies. In stream applications like publish/subscribe or XML packet routing this evaluation needs to be done at a speed comparable with the network throughput, and scale to large numbers of XPath expressions... We report here one novel technique for stream XML processing called Stream IndeX, SIX, and describe its usage in conjunction with the stand-alone tools. A SIX for an XML file (or XML stream) consists of a sequence of byte offsets in the XML file that can be used by the XPath processor to skip unneeded portions. When used in applications like XML packet routing, the SIX needs to be computed only once for each packet, which can be done when the XML packet is first generated, then routed together with the packet... The work closest to our toolkit is LT XML. It defines a C-based API for processing XML files, and builds a large number of tools using this API. Their emphasis is on completeness, rather than scalability: there is a rich set of tools for searching and transforming XML files, including a small query processor..."

Principal references:

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: