A dynamic warehouse for the XML data of the Web
(preliminary report)

Xyleme

Abstract: Xyleme is a dynamic warehouse for the XML data of the Web supporting change control and data integration. We briefly present our motivations, the general architecture, the main aspects of Xyleme and the status of the work.
Xyleme: a complex tissue of wood cells, functions in conduction and storage ...

1 Introduction et motivation

The current development of the Web and the generalization of XML technology provides a major opportunity which can radically change the face of the Web. We are working on a dynamic XML warehouse for the Web, namely Xyleme. In the present paper, we briefly present our motivations, Xyleme general architecture and main aspects as well as the status of the work.

The Web is huge and keeps growing at a healthy pace. Most of the data is unstructured, consisting of text (essentially HTML) and images. Some is structured, mostly stored in relational databases. All this data constitutes the largest body of information accessible to any individual in the history of humanity. The lack of structure and the distribution of data seriously limit access to the information. To provide functionalities beyond full-text searching ala Alta Vista or Yahoo!, specific applications are being built that require heavy investments. A major evolution is occurring that will dramatically simplify the task of developing such applications.

XML is coming!

XML [xml-w3c] is a standard to exchange structured data over the Web. It is being widely adopted. It is believed that progressively more and more Web data will be in XML form and that DTDs will be available to describe it (see [biztalk, oasis]). Communities (scientific, business, others) will define their own DTD to provide for a uniform representation of data in specific areas. Many already did in as diverse areas as real estate or chemistry. Although a large portion of the Web will remain unstructured or hidden behind interactive graphics and forms, a very large and content-wise essential portion of the Web will soon be open and available in XML.

Given this, we propose to study and build a dynamic World Wide XML warehouse. Indeed, we plan to design a data warehouse capable of storing all the XML data available on the planet. XML is still in infancy, so this is not a very big challenge yet¹ but we believe that it will soon become. So, a major issue is scalability. The problems we address are typical warehousing problems such as change control or data integration. They acquire in the Web context a truly distinct flavor. More precisely, the research directions are as follows:

storage: we need to efficiently store and retrieve huge quantities of XML data (hundreds of millions of pages) with a granularity that goes beyond the document. For this, we use the Natix store that is being developed at U. of Mannheim and is tailored to manage tree data.

query processing: we plan to use the official query language for XML when available. A lot of our effort w.r.t. query processing is in developing the appropriate indexing mechanism. A major issue is the size of indexes.

data acquisition: data will typically be obtained by pull (e.g., discovered by crawling the Web) and push (e.g., publication by Web servers). We are working on techniques to get the Xyleme repository as best as possible up-to-date with the Web. For that, the strategy to refresh pages is essential.

change control: data on the Web is rapidly changing and users are often interested in such changes. We are working on services such as change monitoring and query subscription.

semantic data integration: we want to free users from having to deal with specific DTDs when expressing queries, when typically there will be many DTDs available for the specific domain of interest. In particular, we are investigating automatic data integration based on thesauri of terms and clustering techniques of DTDs.

architecture: Only parallelism can handle such a volume of data (terabytes) and workload. We are distributing the data on a local network of Linux PCs which raises a number of system issues.

data cleaning and duplicate elimination are important issues that we do not address for the moment.

Clearly, these goals are very challenging technically. However, one may question the usefulness of the approach. Certain functionalities of Xyleme are standard and are or will soon be found in many systems. For instance, from a query viewpoint, search engines will certainly take advantage of XML structure to provide (like Xyleme) attribute-equal-value search and support query languages for XML when they will become available. To see another example, from a monitoring viewpoint, some systems such as MindIt already offer some form of monitoring of the Web. What more can be brought by a all-Web warehousing approach?

This will be best illustrated by a couple of examples. First, consider query processing. Certain queries requiring joins are simply not conceivable in a Web context for performance reasons. They become feasible with a warehousing approach. Another possible model would require that each application warehouses the particular data it needs. We believe that a world wide warehouse such as Xyleme is often a cost-effective alternative. Consider now versioning. A user, interested by a specific catalog, may load the catalog at time t. Coming back one week later, the user may want to refresh its version of the catalog and also know what portions changed. Xyleme will support such services that go beyond simple monitoring of document changes and require storing versions of data and query answers. More examples of new services supported by Xyleme will be encountered in the paper.

So, Xyleme will warehouse all the XML data of the Web, host application data (e.g., answers to queries) and, in a second stage, will host the application themselves.

The Xyleme project is functioning as an open, loosely coupled network of researchers. The Verso Group at INRIA was at the origin of the project together with F. Bancilhon, formerly CEO of O₂ Technology. The database groups from Mannheim U. and the CNAM-Paris as well as the IASI Team of University of Paris-Orsay, rapidly joined as well as a number of individual researchers from a number of places.

The paper is organized as follows. Section X introduces the architecture. We discuss the repository in Section X. Section X deals with query processing and indexing. Data acquisition is considered in Section X and change control in Section X. The topic of Section X is data integration. Section X is a conclusion.

2 Architecture

In this section, we discuss the architecture of the system.

The system is functionally organized as follows:

physical level: a repository tailored for tree-data (Natix) and an index manager;

logical level: the query processor and the acquisition of data;

application level: change management and semantic data integration;

interface level: the Web interface for accessing the Web and Xyleme interface with clients;

client module.

Figure 1: Architecture (a) (b)

The functional architecture is pictured in Figure X. All modules are written in C++ except for the client that is mostly supposed to run on standard Web browsers with Java applets. The material architecture consists of PCs running Linux. As shown in Figure X, we (logically) distinguish between several kinds of machines:

repository machines: each is responsible for a large collection of XML document. (Each DTD is assigned to a given repository. So all documents with the same DTD will be on the same machine.)

Index machines: they handle indexes for a large portion of the data. This is a very resource demanding task. These machines have lots of memory. The number of these machines should be expected to grow linearly with the number of pages known by Xyleme.

Web-interface machine(s): it is in charge of crawling and accessing data on the Web. For the moment, there is a single such machine.

Acquisition machines: they are in charge of controlling the (re)loading of Web pages. Some data intensive programs are run periodically that require spreading to several machines. Otherwise, this is a rather modest task.

Application machines: each is responsible of the management of some queries, change management and semantic integration. These are suppose roughly to grow linearly with the number of users.

3 Repository

In this section, we discuss the repository.

Xyleme requires the use of an efficient, update-able storage of XML data. This is an excellent match for Natix developed at the U. of Mannheim.

Typically, two storage approaches have been used for such data: (i) store XML pages as byte streams or (ii) store the data/DOM tree in a conventional DBMS. The first presents the disadvantages that it privileges a certain granularity for data access and requires parsing to obtain the structure. The second artificially creates lots of objects for even medium-size documents and this together with the additional layers of a DBMS tend to slow down processing. Natix uses a hybrid approach. Data are stored as trees until a certain depth where byte streams are used. Furthermore, some minimum structure is used in the ``flat'' part to facilitate access with flexible levels of granularity, Also, Natix offers very flexible means of choosing the border between the object and flat worlds.

Natix is built, in a standard manner, on top of a record manager with fixed-size pages and segments (sequences of pages). Standard functionalities for record management (e.g., buffering) or fast access (e.g., indexes) are provided by Natix. The performances of Xyleme will heavily depend on the efficiency of the repository. The use of a relational system would result in very bad clustering and some waste of space for every single element. An object system would better tackle clustering but would also be very space consuming. We believe Natix will provide better performances because it takes advantage of the particularities of the data. The first measures that are available seem to support that belief. More extensive tests are clearly needed.

Perhaps the most interesting aspect of the system is the splitting of XML trees into pages. An example is shown in Figure X. Observe the use of ``proxy'' nodes (p-nodes) and ``aggregate'' nodes (h-nodes). One can control the behavior of the split by specifying at the DTD level, which splits are not allowed, are forced, or are left to the responsibility of the system.

Figure 2: Distributing nodes on records

4 Query Processing

Queries live at many levels in the system: queries to the ``abstract DTDs'' (see Section X), change queries (see Section X), or actual Xyleme queries issued directly. Ultimately, all result in a complex query being managed by the system. This is the topic of the present section.

Query processing is certainly not a new topic. As a matter of fact, Xyleme query processor relies mainly on relational and object-oriented techniques. Still, the particularities of XML and the huge size of the warehouse we are considering raise a number of interesting problems.

Pattern Matching: XQL [RoLaSc1998], as well as most XML query languages, provides some pattern matching facilities that are needed, e.g., to express the following query: Find the documents in which there is a product somewhere below a company whose name resembles ``Xyleme''.

ChClSi2000, ChClMo1996

PatternScan

Full-Text Indexes: Traditionaly, full text indexes return a list of documents in which a word occur. We are considering answering more complex queries, and notably, we want to use them to implement the PatternScan operator (see above). In order to do that efficiently, we extend these indexes by annotating each occurence of a word with some code that allows to locate it in the document, relatively to some other word. To guarantee good performances, the word encoding must have some properties.

(i) It must be dense. Some experiments we made showed that by adding 2 integers to each word of a full text index, we multiply its size by a factor of 1.5. Since, in order to provide performances similar to those of current search engines, we need to keep Xyleme indexes in memory, we should be very careful about the way we encode words if we do not want to manage twice as many machines as needed.

(ii) It should be update-able. The update of a full text index is an expensive operation that has to be performed each time a document is loaded in the warehouse. One way to minimize the cost of this operation, is to make a diff between old and new versions of the document and to add/remove only the elements that have been changed. However, most encoding schemes we found do not support updates well. Notably, the insertion of one element can invalidate the codes of all others.

(iii) It should help us identify elements in the store. If there is no connection between index encoding and the way documents are stored on disk, the index can only be used to filter documents, not elements. Thus, we cannot exploit the different levels of granularity supported by Natix.

So far, we did not find an encoding scheme that had these three properties. We are planning on using two codes. One relies on structural information to avoid duplicating the internal nodes of a document. Although, not dense in general, we believe it should be a good candidate for, e.g., XML data generated from databases. The other encoding is borrowed from the information retrieval community and encodes intervals (one word is below another if its interval is included in that of the other word).

Incomplete and Approximative Answers: One intended use of Xyleme is as a sophisticate search engine. People searching the Web, are usually very impatient, do not care for large answers and certainly do not accept ``no'' as a satisfactory answer. In order to take this facts into account, Xyleme query processor uses lower and upper bounds: the processing cannot stop until we found a minimal number of answers and should stop as soon as we found some maximal number. The lower bound introduces approximativity: if we do not find an exact answer, we somehow have to relax the query. This is done in several steps, each removing some structure to the query. The final step leads to a standard keyword search. The upper bound introduces incompleteness. Obvioulsy, data coming from important documents (see Section X) should be returned in priority. This raises many non trivial problems when queries are distributed or involve joins. Currently, we are considering picking answers according to processing time rather than importance.
Queries against Views: Although a query plan usually ends up being distributed over several machines, it is generated on one single machine so as to avoid (i) duplicating optimization efforts and (ii) unecessary communications. To achieve reasonnable performances, all the information needed by this process must be available on the machine it is running on, preferably in RAM. This information includes views. As will be explained in Section X, views are used in Xyleme to provide a simple and clever interface to a huge number of documents described by heterogeneous DTDs. Accordingly, a view definition can be very large, certainly not fitting one single RAM. Still, some of this information is needed if we want to address the appropriate machines with the appropriate plans. Again, we use some encoding schemes to represent paths in some documents. Although the three properties we listed above are still accurate, updatability is more critical here.

5 Data Acquisition

In this section, we consider the issue of data acquisition. In a first stage, we crawl the Web in search of XML data. This is a standard task. We implemented our own crawler that goes at the honest speed of 100 000 pages per hour. Note that most pages are not stored or indexed by Xyleme, e.g., gif pages or even html pages. Like every crawler we know of, the crawler is parameterized by some depth k. The semantics is that, when discovering a new HTML page in a site, the crawler should not look deeper than k without exiting that site. This tends to privilege breadth vs. depth crawling to some extent. To be able to crawl fast, we need, given an URL, to be able to decide very fast whether it is new to Xyleme or not. For that, we use URL ``signatures'' and a large hash table at the cost of (statistically rarely) detecting that we already know a page when we do not.

A critical issue is the strategy to decide when to refresh a document. Xyleme administrators should be able to specify some general acquisition policies, e.g., whether it is more urgent to explore new branches of the Web or to refresh the documents we already have. The decision to refresh certain documents is based on criteria such as:

Publication data: In push mode, the owner of the document may let Xyleme know when a change occurs or may specify a frequency of refreshment.

Temporal information about a document such as last-time-read or its change rate. A document that has not changed for 6 months is likely to be an archive and its refresh frequency should be very low. At the other extreme, a document that changes all the time may represent real-time data such as stock exchange quotes that Xyleme should not attempt to keep in pace with.

Factual information such as the document size or its importance as determined by Xyleme.

Periodically (e.g., a couple of times a day) Xyleme determines the next pages to refresh. The strategy is a dynamic optimization. More precisely, we try to minimize, for each time period (say half a day), a particular cost function under some constraint. The cost is defined as the dissatisfaction of users being presented stale data; the constraint is the average number of pages that Xyleme can read per that time period. The cost function is designed so that important and rapidly changing pages are read more often. Also, eventually all pages get read (no eternal starvation) unless there is an explicit decision by the administrator to leave a portion of the Web out, e.g., pages below a certain importance threshold.

From this, it should be clear that the quality of page acquisition depends on the quality of the guiding parameters. It is reasonably easy to estimate the change rate of a page based on how many times a change was detected in the last (say 100) attempts to refresh the page. On the other hand, it is more difficult to estimate the importance of the page. The number of access is difficult to evaluate and may be meaningless if the page is accessed often uselessly. We prefer to use a page ranking in the style of Google [ibmrank, google]. The intuition is that an important page carries its importance to the pages it points to. This leads to computing the fixpoint of a function over the set of pages.

We run our page ranking program on the graph of XML and HTML pages. This in itself is not a simple task because of the hundreds of millions of HTML pages. But clearly, the importance of XML pages cannot be regarded in isolation. The program provides the importance of all pages of the Web and can be used (i) to rank XML pages in query results and (ii) guide the refreshing of XML pages. Observe also that to perform this page ranking, we need the link structure² of the Web. The maintenance of this data structure, when pages are constantly loaded, is also a very intense task.

Note that we also have to refresh HTML pages since the new version of an ``old'' HTML page may lead to discover new XML pages. To decide when to refresh such HTML pages, the above ranking is not useful. For instance, a very important HTML page is useless for us if it does not lead to XML pages. We use another definition of importance for HTML pages (importance for Xyleme) which leads to another fixpoint evaluation. This entire process is rather expensive but it is only run periodically and measures for our first prototype show that it is feasible even if we consider hundreds of millions of HTML pages.

6 Change Control

Users are often not only interested in the current values of documents but also in their evolution. They want to see changes as information that can be consulted or queried. Indeed, change detection and notification is becoming an important part of Web services, e.g., [mindit, doubletwist, ice].

Since Xyleme is a Web warehouse, it is impossible to have a real time vision of data. For instance, the time of a document version is the time when Xyleme decides to read or refresh the document. For each page, Xyleme thus gets snapshots of the page. Given two consecutive versions of page D at time t_i_-1 and t_i, we can compute the modifications D_i-1,i that occurred in between using a diff algorithm. We use the diff of Shasha and Wang [shasha-algo]. Such deltas can be used to manage the monitoring of changes. It can also serve as the basis for a query subscription and a versioning mechanism. Observe however that several functionalities that we will consider may be very space consuming. In particular, we will not be able to version all the XML pages of the Web.

We discuss next a number of services that Xyleme is addressing:

Client store refreshing: A Xyleme client may request some documents, collections of documents, or query result, say V_i at time i and store this data on its local computer. The client may then request D_i,now to refresh its store or see what changed. This is in the spirit of the Information and Content Exchange, ICE [ice, oasis-ice], a standard for electronic commerce.

Monitoring changes: This is the first facet of a query subscription system we are implementing. At the time we compute a delta, we verify for each atomic change (by simple lookups in an index), whether this change is monitored by some subscription. Examples of changes that can be monitored include, for instance, any modification of a document or the insertion of a new item in a catalog.

ICE also provides a protocol for notifications. New services such as MindIt [mindit] allow to monitor changes at the page level in the Web. We plan to support monitoring at the node level as well.

Querying changes: Deltas are XML documents and thus may be queried like any other documents. For instance, one will be able to ask for the list of all items introduced in a catalog since January, or for the movies that started in Paris this week. Queries against changes form a second facet of the query subscription system. We believe that they get their full power as continuous queries [Niagara], i.e., queries asked regularly to the warehouse. We believe that continuous queries over change often are what the users are interested in, e.g., notify me once a week of the top 20 new documents about XML.

Versioning: We will be able to obtain an old version of a document or part of one. Although less important, this may be sometimes useful. (Note that a version mechanism is useful beyond, e.g., for client refresh.)

Temporal Queries: In some sense, this can be viewed as another way to query changes. The queries are addressed to a (virtual) logical representation of the data history using a temporal query language.

To represent versions, we opted for a change-centric approach as opposed to a data-centric one that is more natural for database versioning. We use deltas in the style of [hull]. Deltas are often lossy and cannot be inverted. As often done in physical logs, we use completed deltas which also contain additional information so they can be reversed. In our version representation, all XML nodes are assigned a persistent identifier, that we call XID for Xyleme ID. The use of persistent identifiers is essential to describe changes and also query changes efficiently. For each versioned document, we store the current version, an XID-map (an original way to represent the IDs of this version) and the ``forward unit completed deltas'' (an XML document). An example of this is shown in Figure X. In the figure, the XIDs are shown in the tree although in reality they are not stored inside the document but in the XID-map only.

An essential aspect of our management of versions is that, when a new version is installed, the new unit delta is appended to the existing delta for that page. This can be done without having to update data in place in the store. The price for it is that we store redundant information. A delta compression technique is used periodically to recover space.

Figure 3: Storage of a versioned document

7 Semantic Data Integration

In this section, we address the issue of providing some form of semantic data integration in Xyleme. More precisely, for a given application domain, we want to offer users the means to access data without bothering to learn the exact DTD or multiple DTDs that structure data in this particular domain. In particular, a user may want to ask queries that extract and combine data from XML documents with different DTDs without having the possibly heavy task of expressing such queries over several DTDs directly. We would like to express queries at an abstract and semantic level and the system to automatically reformulate the demand into actual queries in Xyleme query language.

To achieve this goal requires (i) some extra knowledge of the semantics of the particular DTDs, (ii) some processing to analyze the particular knowledge, and (iii) the compilation of such abstract queries into actual Xyleme queries.

An optimal solution for us would be that this extra knowledge be directly provided by the designer of the particular DTDs or some domain expert as meta-data in languages such as RDF or XML-Schema. Such approaches will be essential in the long range and we are studying their impact. But for now, we should not expect to find the knowledge we need in this manner since the standards are not yet agreed upon. We consider how to obtain automatically such knowledge or how it can be provided by an expert.

Such knowledge can be automatically acquired using available ontologies or thesauri. Several general or specialized ontologies exist. WordNet, for instance, is a general linguistic ontology providing a hierarchical organization of synonym sets, each one representing one underlying lexical concept. The hierarchical categories of topics provided by the search engine Yahoo! can also be seen as such an ontology. The problem is then to plug in or match existing ontologies or thesauri with DTDs known by Xyleme. The problem varies depending on whether the Xyleme data that we consider are pulled or pushed. We can expect from the people who push data to the Xyleme server that they can characterize their data relatively to the list of the given topics which constitute the Xyleme ontology. For pulled data, the problem is different because we have to find a way to connect the structure of the DTD with our given set of topics.

Our first work in that direction is the design and implementation of scalable machine learning techniques in order to automatically classify DTDs into clusters, according to similarities existing between subsets of DTDs and/or XML documents. The expected volume (thousands and soon more) makes it quite challenging. Once a cluster of actual DTDs relating to a unique domain has been isolated, the next task is to automatically generate an abstract DTD covering the common aspects of these DTDs and their particularities. Such processing is done off-line and leads to the incremental building or enrichment of a set of forms constituting semantic views over the Xyleme repository. The connection between that semantic layer and the data stored in Xyleme relies on the automatic translation from the language of forms to the Xyleme query language.

We are currently investigating and implementing these two major tasks (semantic classification and integration). We are also experimenting with more ``manual'' approaches. More precisely, humans are specifying the mappings between concrete and abstract DTDs in two application domains, culture (museum) and travels. Clearly, the automatic and manual approaches could be used alternatively to build rich semantic domains.

In this setting, the technology we use consists of knowledge representation techniques that offer declarative formalisms for expressing semantic and flexible views over data, together with (i) inference algorithms that provide services for automatic classification and (ii) query rewriting using views.

8 Conclusion

As mentioned earlier, Xyleme is a project distributed between several groups mostly in France but in Germany as well. The design of Version 1 of the system (planed for end of 2000) is almost complete. The repository, the crawler and the page ranker are running. The rest code-wise varies from mostly done to still in limbo. We intend to discover new problems along the way and we believe that the experience will remain as challenging and exciting as it has been so far.

1: The number of accessible XML pages is marginal for the moment, less that 1% of the number of HTML pages. However, we believe that many companies already switched to XML and that the number of XML pages will explode as soon as XML will be supported by most of the installed browsers. within a year?
2: For instance, with 200 million pages, 1.5 billion links were found in [graphstructure]. This gives some idea of the size of the data structure that has to be maintained and the difficulty of the task.

This document was translated from L^AT_EX by H^EV^EA.

A dynamic warehouse for the XML data of the Web (preliminary report)