A dynamic warehouse for the XML data of the Web
(preliminary report)



Abstract: Xyleme is a dynamic warehouse for the XML data of the Web supporting change control and data integration. We briefly present our motivations, the general architecture, the main aspects of Xyleme and the status of the work.

Xyleme: a complex tissue of wood cells, functions in conduction and storage ...

1  Introduction et motivation

The current development of the Web and the generalization of XML technology provides a major opportunity which can radically change the face of the Web. We are working on a dynamic XML warehouse for the Web, namely Xyleme. In the present paper, we briefly present our motivations, Xyleme general architecture and main aspects as well as the status of the work.

The Web is huge and keeps growing at a healthy pace. Most of the data is unstructured, consisting of text (essentially HTML) and images. Some is structured, mostly stored in relational databases. All this data constitutes the largest body of information accessible to any individual in the history of humanity. The lack of structure and the distribution of data seriously limit access to the information. To provide functionalities beyond full-text searching ala Alta Vista or Yahoo!, specific applications are being built that require heavy investments. A major evolution is occurring that will dramatically simplify the task of developing such applications.

XML is coming!
XML [xml-w3c] is a standard to exchange structured data over the Web. It is being widely adopted. It is believed that progressively more and more Web data will be in XML form and that DTDs will be available to describe it (see [biztalk, oasis]). Communities (scientific, business, others) will define their own DTD to provide for a uniform representation of data in specific areas. Many already did in as diverse areas as real estate or chemistry. Although a large portion of the Web will remain unstructured or hidden behind interactive graphics and forms, a very large and content-wise essential portion of the Web will soon be open and available in XML.

Given this, we propose to study and build a dynamic World Wide XML warehouse. Indeed, we plan to design a data warehouse capable of storing all the XML data available on the planet. XML is still in infancy, so this is not a very big challenge yet1 but we believe that it will soon become. So, a major issue is scalability. The problems we address are typical warehousing problems such as change control or data integration. They acquire in the Web context a truly distinct flavor. More precisely, the research directions are as follows:

Clearly, these goals are very challenging technically. However, one may question the usefulness of the approach. Certain functionalities of Xyleme are standard and are or will soon be found in many systems. For instance, from a query viewpoint, search engines will certainly take advantage of XML structure to provide (like Xyleme) attribute-equal-value search and support query languages for XML when they will become available. To see another example, from a monitoring viewpoint, some systems such as MindIt already offer some form of monitoring of the Web. What more can be brought by a all-Web warehousing approach?

This will be best illustrated by a couple of examples. First, consider query processing. Certain queries requiring joins are simply not conceivable in a Web context for performance reasons. They become feasible with a warehousing approach. Another possible model would require that each application warehouses the particular data it needs. We believe that a world wide warehouse such as Xyleme is often a cost-effective alternative. Consider now versioning. A user, interested by a specific catalog, may load the catalog at time t. Coming back one week later, the user may want to refresh its version of the catalog and also know what portions changed. Xyleme will support such services that go beyond simple monitoring of document changes and require storing versions of data and query answers. More examples of new services supported by Xyleme will be encountered in the paper.

So, Xyleme will warehouse all the XML data of the Web, host application data (e.g., answers to queries) and, in a second stage, will host the application themselves.

The Xyleme project is functioning as an open, loosely coupled network of researchers. The Verso Group at INRIA was at the origin of the project together with F. Bancilhon, formerly CEO of O2 Technology. The database groups from Mannheim U. and the CNAM-Paris as well as the IASI Team of University of Paris-Orsay, rapidly joined as well as a number of individual researchers from a number of places.

The paper is organized as follows. Section X introduces the architecture. We discuss the repository in Section X. Section X deals with query processing and indexing. Data acquisition is considered in Section X and change control in Section X. The topic of Section X is data integration. Section X is a conclusion.

2  Architecture

In this section, we discuss the architecture of the system.

The system is functionally organized as follows:

Figure 1: Architecture (a) (b)
The functional architecture is pictured in Figure X. All modules are written in C++ except for the client that is mostly supposed to run on standard Web browsers with Java applets. The material architecture consists of PCs running Linux. As shown in Figure X, we (logically) distinguish between several kinds of machines:

3  Repository

In this section, we discuss the repository.

Xyleme requires the use of an efficient, update-able storage of XML data. This is an excellent match for Natix developed at the U. of Mannheim.

Typically, two storage approaches have been used for such data: (i) store XML pages as byte streams or (ii) store the data/DOM tree in a conventional DBMS. The first presents the disadvantages that it privileges a certain granularity for data access and requires parsing to obtain the structure. The second artificially creates lots of objects for even medium-size documents and this together with the additional layers of a DBMS tend to slow down processing. Natix uses a hybrid approach. Data are stored as trees until a certain depth where byte streams are used. Furthermore, some minimum structure is used in the ``flat'' part to facilitate access with flexible levels of granularity, Also, Natix offers very flexible means of choosing the border between the object and flat worlds.

Natix is built, in a standard manner, on top of a record manager with fixed-size pages and segments (sequences of pages). Standard functionalities for record management (e.g., buffering) or fast access (e.g., indexes) are provided by Natix. The performances of Xyleme will heavily depend on the efficiency of the repository. The use of a relational system would result in very bad clustering and some waste of space for every single element. An object system would better tackle clustering but would also be very space consuming. We believe Natix will provide better performances because it takes advantage of the particularities of the data. The first measures that are available seem to support that belief. More extensive tests are clearly needed.

Perhaps the most interesting aspect of the system is the splitting of XML trees into pages. An example is shown in Figure X. Observe the use of ``proxy'' nodes (p-nodes) and ``aggregate'' nodes (h-nodes). One can control the behavior of the split by specifying at the DTD level, which splits are not allowed, are forced, or are left to the responsibility of the system.

Figure 2: Distributing nodes on records

4  Query Processing

Queries live at many levels in the system: queries to the ``abstract DTDs'' (see Section X), change queries (see Section X), or actual Xyleme queries issued directly. Ultimately, all result in a complex query being managed by the system. This is the topic of the present section.

Query processing is certainly not a new topic. As a matter of fact, Xyleme query processor relies mainly on relational and object-oriented techniques. Still, the particularities of XML and the huge size of the warehouse we are considering raise a number of interesting problems.

5  Data Acquisition

In this section, we consider the issue of data acquisition. In a first stage, we crawl the Web in search of XML data. This is a standard task. We implemented our own crawler that goes at the honest speed of 100 000 pages per hour. Note that most pages are not stored or indexed by Xyleme, e.g., gif pages or even html pages. Like every crawler we know of, the crawler is parameterized by some depth k. The semantics is that, when discovering a new HTML page in a site, the crawler should not look deeper than k without exiting that site. This tends to privilege breadth vs. depth crawling to some extent. To be able to crawl fast, we need, given an URL, to be able to decide very fast whether it is new to Xyleme or not. For that, we use URL ``signatures'' and a large hash table at the cost of (statistically rarely) detecting that we already know a page when we do not.

A critical issue is the strategy to decide when to refresh a document. Xyleme administrators should be able to specify some general acquisition policies, e.g., whether it is more urgent to explore new branches of the Web or to refresh the documents we already have. The decision to refresh certain documents is based on criteria such as:

Periodically (e.g., a couple of times a day) Xyleme determines the next pages to refresh. The strategy is a dynamic optimization. More precisely, we try to minimize, for each time period (say half a day), a particular cost function under some constraint. The cost is defined as the dissatisfaction of users being presented stale data; the constraint is the average number of pages that Xyleme can read per that time period. The cost function is designed so that important and rapidly changing pages are read more often. Also, eventually all pages get read (no eternal starvation) unless there is an explicit decision by the administrator to leave a portion of the Web out, e.g., pages below a certain importance threshold.

From this, it should be clear that the quality of page acquisition depends on the quality of the guiding parameters. It is reasonably easy to estimate the change rate of a page based on how many times a change was detected in the last (say 100) attempts to refresh the page. On the other hand, it is more difficult to estimate the importance of the page. The number of access is difficult to evaluate and may be meaningless if the page is accessed often uselessly. We prefer to use a page ranking in the style of Google [ibmrank, google]. The intuition is that an important page carries its importance to the pages it points to. This leads to computing the fixpoint of a function over the set of pages.

We run our page ranking program on the graph of XML and HTML pages. This in itself is not a simple task because of the hundreds of millions of HTML pages. But clearly, the importance of XML pages cannot be regarded in isolation. The program provides the importance of all pages of the Web and can be used (i) to rank XML pages in query results and (ii) guide the refreshing of XML pages. Observe also that to perform this page ranking, we need the link structure2 of the Web. The maintenance of this data structure, when pages are constantly loaded, is also a very intense task.

Note that we also have to refresh HTML pages since the new version of an ``old'' HTML page may lead to discover new XML pages. To decide when to refresh such HTML pages, the above ranking is not useful. For instance, a very important HTML page is useless for us if it does not lead to XML pages. We use another definition of importance for HTML pages (importance for Xyleme) which leads to another fixpoint evaluation. This entire process is rather expensive but it is only run periodically and measures for our first prototype show that it is feasible even if we consider hundreds of millions of HTML pages.

6  Change Control

Users are often not only interested in the current values of documents but also in their evolution. They want to see changes as information that can be consulted or queried. Indeed, change detection and notification is becoming an important part of Web services, e.g., [mindit, doubletwist, ice].

Since Xyleme is a Web warehouse, it is impossible to have a real time vision of data. For instance, the time of a document version is the time when Xyleme decides to read or refresh the document. For each page, Xyleme thus gets snapshots of the page. Given two consecutive versions of page D at time ti-1 and ti, we can compute the modifications Di-1,i that occurred in between using a diff algorithm. We use the diff of Shasha and Wang [shasha-algo]. Such deltas can be used to manage the monitoring of changes. It can also serve as the basis for a query subscription and a versioning mechanism. Observe however that several functionalities that we will consider may be very space consuming. In particular, we will not be able to version all the XML pages of the Web.

We discuss next a number of services that Xyleme is addressing:

To represent versions, we opted for a change-centric approach as opposed to a data-centric one that is more natural for database versioning. We use deltas in the style of [hull]. Deltas are often lossy and cannot be inverted. As often done in physical logs, we use completed deltas which also contain additional information so they can be reversed. In our version representation, all XML nodes are assigned a persistent identifier, that we call XID for Xyleme ID. The use of persistent identifiers is essential to describe changes and also query changes efficiently. For each versioned document, we store the current version, an XID-map (an original way to represent the IDs of this version) and the ``forward unit completed deltas'' (an XML document). An example of this is shown in Figure X. In the figure, the XIDs are shown in the tree although in reality they are not stored inside the document but in the XID-map only.

An essential aspect of our management of versions is that, when a new version is installed, the new unit delta is appended to the existing delta for that page. This can be done without having to update data in place in the store. The price for it is that we store redundant information. A delta compression technique is used periodically to recover space.

Figure 3: Storage of a versioned document

7  Semantic Data Integration

In this section, we address the issue of providing some form of semantic data integration in Xyleme. More precisely, for a given application domain, we want to offer users the means to access data without bothering to learn the exact DTD or multiple DTDs that structure data in this particular domain. In particular, a user may want to ask queries that extract and combine data from XML documents with different DTDs without having the possibly heavy task of expressing such queries over several DTDs directly. We would like to express queries at an abstract and semantic level and the system to automatically reformulate the demand into actual queries in Xyleme query language.

To achieve this goal requires (i) some extra knowledge of the semantics of the particular DTDs, (ii) some processing to analyze the particular knowledge, and (iii) the compilation of such abstract queries into actual Xyleme queries.

An optimal solution for us would be that this extra knowledge be directly provided by the designer of the particular DTDs or some domain expert as meta-data in languages such as RDF or XML-Schema. Such approaches will be essential in the long range and we are studying their impact. But for now, we should not expect to find the knowledge we need in this manner since the standards are not yet agreed upon. We consider how to obtain automatically such knowledge or how it can be provided by an expert.

Such knowledge can be automatically acquired using available ontologies or thesauri. Several general or specialized ontologies exist. WordNet, for instance, is a general linguistic ontology providing a hierarchical organization of synonym sets, each one representing one underlying lexical concept. The hierarchical categories of topics provided by the search engine Yahoo! can also be seen as such an ontology. The problem is then to plug in or match existing ontologies or thesauri with DTDs known by Xyleme. The problem varies depending on whether the Xyleme data that we consider are pulled or pushed. We can expect from the people who push data to the Xyleme server that they can characterize their data relatively to the list of the given topics which constitute the Xyleme ontology. For pulled data, the problem is different because we have to find a way to connect the structure of the DTD with our given set of topics.

Our first work in that direction is the design and implementation of scalable machine learning techniques in order to automatically classify DTDs into clusters, according to similarities existing between subsets of DTDs and/or XML documents. The expected volume (thousands and soon more) makes it quite challenging. Once a cluster of actual DTDs relating to a unique domain has been isolated, the next task is to automatically generate an abstract DTD covering the common aspects of these DTDs and their particularities. Such processing is done off-line and leads to the incremental building or enrichment of a set of forms constituting semantic views over the Xyleme repository. The connection between that semantic layer and the data stored in Xyleme relies on the automatic translation from the language of forms to the Xyleme query language.

We are currently investigating and implementing these two major tasks (semantic classification and integration). We are also experimenting with more ``manual'' approaches. More precisely, humans are specifying the mappings between concrete and abstract DTDs in two application domains, culture (museum) and travels. Clearly, the automatic and manual approaches could be used alternatively to build rich semantic domains.

In this setting, the technology we use consists of knowledge representation techniques that offer declarative formalisms for expressing semantic and flexible views over data, together with (i) inference algorithms that provide services for automatic classification and (ii) query rewriting using views.

8  Conclusion

As mentioned earlier, Xyleme is a project distributed between several groups mostly in France but in Germany as well. The design of Version 1 of the system (planed for end of 2000) is almost complete. The repository, the crawler and the page ranker are running. The rest code-wise varies from mostly done to still in limbo. We intend to discover new problems along the way and we believe that the experience will remain as challenging and exciting as it has been so far.

The number of accessible XML pages is marginal for the moment, less that 1% of the number of HTML pages. However, we believe that many companies already switched to XML and that the number of XML pages will explode as soon as XML will be supported by most of the installed browsers. within a year?
For instance, with 200 million pages, 1.5 billion links were found in [graphstructure]. This gives some idea of the size of the data structure that has to be maintained and the difficulty of the task.

This document was translated from LATEX by HEVEA.