Cover Pages: Xyleme Project: Dynamic Data Warehouse for the XML Data of the Web

[January 26, 2001] The Xyleme Project "is functioning as an open, loosely coupled network of researchers. The Verso Group at INRIA was at the origin of the project together with F. Bancilhon, formerly CEO of O2 Technology. The database groups from Mannheim U. and the CNAM-Paris as well as the IASI Team of University of Paris-Orsay, rapidly joined as well as a number of individual researchers from a number of places. The goal is to design and prototype a data warehouse able to exploit all XML data that can be found on the Web. The emphasis is on high level services that difficult or impossible to support with the current Web technologies. In particular, we consider more complex query processing that the simple keyword search of actual search engines, semantic data integration and sophisticate monitoring of changes. Xyleme will warehouse all the XML data of the Web, host application data (e.g., answers to queries) and, in a second stage, will host the application themselves. The project research directions are as follows: (1) storage: we need to efficiently store and retrieve huge quantities of XML data (hundreds of millions of pages) with a granularity that goes beyond the document. For this, we use the Natix store that is being developed at U. of Mannheim and is tailored to manage tree data. (2) query processing: we plan to use the official query language for XML when available. A lot of our effort with respect to query processing is in developing the appropriate indexing mechanism. A major issue is the size of indexes. (3) data acquisition: data will typically be obtained by pull (e.g., discovered by crawling the Web) and push (e.g., publication by Web servers). We are working on techniques to get the Xyleme repository as best as possible up-to-date with the Web. For that, the strategy to refresh pages is essential. (4) change control: data on the Web is rapidly changing and users are often interested in such changes. We are working on services such as change monitoring and query subscription. (5) semantic data integration: we want to free users from having to deal with specific DTDs when expressing queries, when typically there will be many DTDs available for the specific domain of interest. In particular, we are investigating automatic data integration based on thesauri of terms and clustering techniques of DTDs. (6) architecture: Only parallelism can handle such a volume of data (terabytes) and workload. We are distributing the data on a local network of Linux PCs which raises a number of system issues. (7) data cleaning and duplicate elimination are important issues that we do not address for the moment." Project rationale: "XML is a standard to exchange structured data over the Web. It is being widely adopted. It is believed that progressively more and more Web data will be in XML form and that DTDs will be available to describe it (Biztalk, OASIS). Communities (scientific, business, others) will define their own DTD to provide for a uniform representation of data in specific areas. Many already did in as diverse areas as real estate or chemistry. Although a large portion of the Web will remain unstructured or hidden behind interactive graphics and forms, a very large and content-wise essential portion of the Web will soon be open and available in XML. Given this, we propose to study and build a dynamic World Wide XML warehouse. Indeed, we plan to design a data warehouse capable of storing all the XML data available on the planet. XML is still in infancy, so this is not a very big challenge yet1 but we believe that it will soon become. So, a major issue is scalability. The problems we address are typical warehousing problems such as change control or data integration."

References:

Xyleme Project Web site
Xyleme Project Preliminary Report [cache]
Xyleme Technical Reports
"Change-centric management of versions in an XML warehouse." [cache]
"Acquiring XML pages for a Web house."
Xyleme Overview - Slide presentation
[March 24, 2001] "Representing and Querying XML with Incomplete Information." By Serge Abiteboul (INRIA), Luc Segoufin (INRIA), and Victor Vianu (UC San Diego). Paper presented at PODS 2001. Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). May 21 - 24, 2001. Santa Barbara, California, USA. With 25 references. Abstract: "We study the representation and querying of XML with incomplete information. We consider a simple model for XML data and their DTDs, a very simple query language, and a representation system for incomplete information in the spirit of the representations systems developed by Imielinski and Lipski for relational databases. In the scenario we consider, the incomplete information about an XML document is continuously enriched by successive queries to the document. We show that our representation system can represent partial information about the source document acquired by successive queries, and that it can be used to intelligently answer new queries. We also consider the impact on complexity of enriching our representation system or query language with additional features. The results suggest that our approach achieves a practically appealing balance between expressiveness and tractability. The research presented here was motivated by the Xyleme project at INRIA, whose objective it is to develop a data warehouse for Web XML documents... The main contribution of this paper is a simple framework for acquiring, maintaining, and querying XML documents with incomplete information. The framework provides a model for XML documents and DTDs, a simple XML query language, and a representation system for XML with incomplete information. We show that the incomplete information acquired by consecutive queries and answers can be effciently represented and incrementally refined using our representation system. Queries are handled effciently and exibly. They are answered as best possible using the available information, either completely, orby providing an incomplete answer using our representation system. Alternatively, full answers can be provided by completing the partial information using additional queries to the sources, guaranteed to be non-redundant. Our framework is limited in many ways. For example, we assume that sources provide persistent node ids. Order in documents and DTDs is ignored, and is not used by queries. The query language is very simple, and does not use recursive path expressions and data joins. In order to trace the boundary of tractability, we considered several extensions to our framework and showed that they have significant impact on handling incomplete information, ranging from cosmetic to high complexity or undecidability. This justifies the particular cocktail of features making up our framework, and suggests that it provides a practically appealing solution to handling incomplete information in XML."
[March 24, 2001] "Xyleme, une start-up de l'Inria pour structurer le Web en XML." From 01net.com. March 01, 2001. Xyleme veut structurer les données sémantiques du Web en XML. Objectif? Construire un moteur de recherche professionnel, interrogeable à partir du systhme d'information de l'entreprise." ["The Web is moving from HTML to XML, with all the major players, Microsoft, IBM, Oracle, content providers, B2B enablers, behind this revolution. Xyleme exploits this revolution to create a new service through an indexed XML repository that stores Web knowledge and that is capable of answering queries from applications and users. The outcome is a seamless integration between the web and corporate information systems... Xyleme is designed to store, classify, index and monitor XML data on the Web. The emphasis is on high level services that are difficult or impossible to support with the current Web technologies. In particular, we consider more complex query processing than the simple keyword search of actual search engines, semantic data integration and sophisticated monitoring of changes..."]


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY