Cover Pages: Tukwila Data Integration System (University of Washington)

[May 10, 2001] The University of Washington Database Research Group is developing a 'Tukwila' system which "uses adaptive query processing techniques to efficiently deal with processing heterogeneous, XML-based data from across the Internet. The data integration system depends upon a mediated schema to represent a particular application domain and data sources are mapped as views over the mediated schema. The user asks a query over the mediated schema and the data integration system reformulates this into a query over the data sources and executes it. The system then intelligently processes the query, reading data across the network and responding to data source sizes, network conditions, and other factors. The Tukwila data integration system is designed to scale up to the amounts of data transmissible across intranets and the Internet (tens to hundreds of MBs), with large numbers of data sources. The Tukwila data integration system is designed to support adaptivity at its core using a two-pronged approach. A highly efficient query reformulation algorithm, MiniCon, maps the input query from the mediated schema to the data sources. Next, interleaved planning and execution with partial optimization are used to allow Tukwila to process the reformulated plan, quickly recovering if decisions were based on inaccurate estimates. The system provides integrated support for efficient processing of XML data, based on the x-scan operator. X-scan efficiently processes non-materialized XML data as it is being received by the data integration system; it matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network. XML provides a common encoding for data from many different sources; combined with standardization of schemas (DTDs) across certain domains, it greatly reduces the needs for wrappers and even query reformulation. The latest versions of Tukwila are built around an adaptive query processing architecture for XML, and can seamlessly combine XML and relational data into new XML content."

"X-Scan: a Foundation for XML Data Integration. Most existing XML query processing systems convert XML documents to an internal representation, generally a set of tables or objects; path expressions are evaluated using either index structures or join operations across the tables or objects. Unfortunately, the required index creation or join operations are often costly even with locally stored data, and they are especially expensive in the data integration domain, where the system reads data streamed from remote sources across a network, and seldom reuses results for subsequent queries. We propose the x-scan operator, which efficiently processes non-materialized XML data as it is being received by the data integration system...Building from x-scan, we have constructed a new version of the Tukwila Data Integration System that efficiently integrates streaming XML and relational data, combining the results seamlessly into new XML content."

"The MiniCon Algorithm is a new algorithm for answering queries using views that is used as the query reformulator for the data integration system Tukwila...Representing the data sources as views over the mediated schema enables new data sources to be added with a minimum of human intervention, but the problem of translating the query into queries over the data sources (query reformulation) is NP-Complete for even conjunctive queries. Our goal in this project was to find a scalable algorithm for query reformulation (which is equivalent to the problem of answering queries using views) and to show experimentally that these algorithms were feasible for even a large number of data sources."

"The Tukwila data integration system introduces a number of new techniques for query reformulation, optimization, and execution. Query processing in data integration occurs over network-bound, autonomous data sources ranging from conventional databases on the LAN or intranet to web-based sources across the Internet. [High volume data] requires extensions to traditional optimization and execution techniques for three reasons: there is an absence of quality statistics about the data, data transfer rates are unpredictable and bursty, and slow or unavailable data sources can often be replaced by overlapping or mirrored sources; additional challenges are posed when we wish to integrate XML data... During execution, Tukwila uses adaptive query operators such as the double pipelined hash join, which produces answers quickly, and the dynamic collector, which robustly and efficiently computes unions across overlapping data sources... The Tukwila query processing components are designed to be self-contained modules that can be swapped out as needed. Each of the main components (reformulator, optimizer, execution engine, and wrappers) are separate code modules, each optionally in a different language and on a different platform. A sockets-based communication interface with a standardized request model allow us to interchange parts."

References:

Tukwila project
Overview
MiniCon algorithm for efficient query reformulation
X-scan operator for XML data integration
The Tukwila Execution Interface
University of Washington Database Research Group
[August 30, 2000] "Efficient Evaluation of Regular Path Expressions on Streaming XML Data." By Zachary G. Ives, Alon Y. Levy, and Daniel S. Weld (University of Washington Database Group). Technical Report UW-CSE-2000-05-02, University of Washington, 2000 Submitted for publication. 22 pages (with 18 references) "The adoption of XML promises to accelerate construction of systems that integrate distributed, heterogeneous data. Query languages for XML are typically based on regular path expressions that traverse the logical XML graph structure; the efficient evaluation of such path expressions is central to good query processing performance. Most existing XML query processing systems convert XML documents to an internal representation, generally a set of tables or objects; path expressions are evaluated using either index structures or join operations across the tables or objects. Unfortunately, the required index creation or join operations are often costly even with locally stored data, and they are especially expensive in the data integration domain, where the system reads data streamed from remote sources across a network, and seldom reuses results for subsequent queries. This paper presents the x-scan operator which efficiently processes non-materialized XML data as it is being received by the data integration system. X-scan matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network. We experimentally demonstrate the benefits of the x-scan operator versus the approaches used in current systems, and we analyze the algorithm's performance and scalability across a range of XML document types and queries. [Conclusions:] In this paper we have presented the x-scan algorithm, a new primitive for XML query processing, that evaluates regular path expressions to produce bindings. X-scan is scalable to larger XML documents than previous approaches and provides important advantages for data integration, with the following contributions: (1) X-scan is pipelined and produces bindings as data is being streamed into the system, rather than requiring an initial stage to store and index the data. (2) X-scan handles graph-structured data, including cyclical data, by resolving and traversing IDREF edges, and it does this following document order and eliminating duplicate bindings. (3) X-scan generates an index of the structure of the XML document, while preserving the original XML structure.(4) X-scan uses a set of dependen finite state machines to efficiently compute variable bindings as edges are traversed. In contrast to semi-structured indexing techniques, x-scan constructs finite automata for the paths in the query, rather than for the paths in the data. (5) X-scan is very efficient, typically imposing only an 8% overhead on top of the time required to parse the XML document. X-scan scales to handle large XML sources and compares favorably to Lore and a commerical XML repository, sometimes even when the cost of loading data into those systems is ignored." Note from ZI home page: "Zack works with Professors Alon Levy and Dan Weld on the Tukwila data integration system. Tukwila uses adaptive query processing techniques to efficiently deal with processing heterogeneous, XML-based data from across the Internet. Current research in Tukwila is in new adaptive query processing techniques for streaming XML data, as well as policies for governing the use of adaptive techniques." For related references, see "XML and Query Languages." [cache]
[August 30, 2000] "X-Scan: a Foundation for XML Data Integration." Project overview. From the University of Washington Database Group. 'The x-scan algorithm is a new operator designed to facilitate integration of XMLdata sources in the context of the Tukwila data integration system.' "The adoption of XML promises to accelerate construction of systems that integrate distributed, heterogeneous data. Query languages for XML are typically based on regular path expressions that traverse the logical XML graph structure; the efficient evaluation of such path expressions is central to good query processing performance. Most existing XML query processing systems convert XML documents to an internal representation, generally a set of tables or objects; path expressions are evaluated using either index structures or join operations across the tables or objects. Unfortunately, the required index creation or join operations are often costly even with locally stored data, and they are especially expensive in the data integration domain, where the system reads data streamed from remote sources across a network, and seldom reuses results for subsequent queries. We propose the x-scan operator, which efficiently processes non-materialized XML data as it is being received by the data integration system. X-scan matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network. We have experimentally demonstrated the benefits of the x-scan operator versus the approaches used in current systems and analyzed the algorithm's performance and scalability across a range of XML document types and queries. [...] X-scan is a new method for evaluating path expressions as data is streaming into the system. The input to x-scan is an XML data stream and a set of regular path expressions occurring in a query; x-scan's output is a stream of bindings for the variables occuring in the expressions. A key feature of x-scan is that it produces these bindings incrementally, as the XML data is streaming in; hence, x-scan fits naturally as the source operator to a complex pipeline, and it is highly suited for data integration applications. X-scan is motivated by the observation that IDREF links are limited to the scope of the current document, so in principle, the entire XML query graph for a document could be constructed in a single pass. X-scan achieves this by simultaneously parsing the XML data, indexing nodes by their IDs, resolving IDREFs, and returning the nodes that match the path expressions of the query. In addition to the path expression evaluation routines, x-scan includes the following functionality: (1) Parsing the XML document; (2) Node ID recording and reference resolving; (3) Creating a graph-structured index of the file; (4) Returning tuples of node locations..."
"A Scalable Algorithm for Answering Queries Using Views." By Rachel Pottinger and Alon Levy. Paper presented at VLDB 2000, Cairo, Egypt, 2000. "The problem of answering queries using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database relations. The problem has received significant attention because of its relevance to a wide variety of data management problems, such as data integration, query optimization, and the maintenance of physical data independence. To date, the performance of proposed algorithms has received very little attention, and in particular, their scale up in the presence of a large number of views is unknown. We first analyze two previous algorithms, the bucket algorithm and the inverse-rules algorithm, and show their deficiencies. We then describe the MiniCon algorithm, a novel algorithm for finding the maximally-contained rewriting of a conjunctive query using a set of conjunctive views. We present the first experimental study of algorithms for answering queries using views. The study shows that the MiniCon algorithm scales up well and significantly outperforms the previous algorithms. Finally, we describe an extension of the MiniCon algorithm to handle comparison predicates, and show its performance experimentally."


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY