The University of Washington Database Research Group is developing a 'Tukwila' system which "uses adaptive query processing techniques to efficiently deal with processing heterogeneous, XML-based data from across the Internet. The data integration system depends upon a mediated schema to represent a particular application domain and data sources are mapped as views over the mediated schema. The user asks a query over the mediated schema and the data integration system reformulates this into a query over the data sources and executes it. The system then intelligently processes the query, reading data across the network and responding to data source sizes, network conditions, and other factors. The Tukwila data integration system is designed to scale up to the amounts of data transmissible across intranets and the Internet (tens to hundreds of MBs), with large numbers of data sources. The Tukwila data integration system is designed to support adaptivity at its core using a two-pronged approach. A highly efficient query reformulation algorithm, MiniCon, maps the input query from the mediated schema to the data sources. Next, interleaved planning and execution with partial optimization are used to allow Tukwila to process the reformulated plan, quickly recovering if decisions were based on inaccurate estimates. The system provides integrated support for efficient processing of XML data, based on the x-scan operator. X-scan efficiently processes non-materialized XML data as it is being received by the data integration system; it matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network. XML provides a common encoding for data from many different sources; combined with standardization of schemas (DTDs) across certain domains, it greatly reduces the needs for wrappers and even query reformulation. The latest versions of Tukwila are built around an adaptive query processing architecture for XML, and can seamlessly combine XML and relational data into new XML content."
"The Tukwila data integration system introduces a number of new techniques for query reformulation, optimization, and execution. Query processing in data integration occurs over network-bound, autonomous data sources ranging from conventional databases on the LAN or intranet to web-based sources across the Internet. [High volume data] requires extensions to traditional optimization and execution techniques for three reasons: there is an absence of quality statistics about the data, data transfer rates are unpredictable and bursty, and slow or unavailable data sources can often be replaced by overlapping or mirrored sources; additional challenges are posed when we wish to integrate XML data... During execution, Tukwila uses adaptive query operators such as the double pipelined hash join, which produces answers quickly, and the dynamic collector, which robustly and efficiently computes unions across overlapping data sources... The Tukwila query processing components are designed to be self-contained modules that can be swapped out as needed. Each of the main components (reformulator, optimizer, execution engine, and wrappers) are separate code modules, each optionally in a different language and on a different platform. A sockets-based communication interface with a standardized request model allow us to interchange parts."