A dynamic warehouse for the XML data of the Web
(preliminary report)
Xyleme
Abstract: Xyleme is a dynamic warehouse for the XML
data of the Web supporting change control and data integration. We briefly
present our motivations, the general architecture, the main aspects of
Xyleme and the status of the work.
Xyleme: a complex tissue of wood cells, functions in conduction and
storage ...
1 Introduction et motivation
The current development of the Web and the generalization of XML technology
provides a major opportunity which can radically change the face of the
Web. We are working on a dynamic XML warehouse for the Web, namely
Xyleme. In the present paper, we briefly present our motivations, Xyleme
general architecture and main aspects as well as the status of the work.
The Web is huge and keeps growing at a healthy pace. Most of the data
is unstructured, consisting of text (essentially HTML) and images. Some
is structured, mostly stored in relational databases. All this data constitutes
the largest body of information accessible to any individual in the history
of humanity. The lack of structure and the distribution of data seriously
limit access to the information. To provide functionalities beyond full-text
searching ala Alta Vista or Yahoo!, specific applications are being built
that require heavy investments. A major evolution is occurring that will
dramatically simplify the task of developing such applications.
XML is coming!
XML [xml-w3c] is a standard to exchange structured
data over the Web. It is being widely adopted. It is believed that progressively
more and more Web data will be in XML form and that DTDs will be available
to describe it (see [biztalk, oasis]).
Communities (scientific, business, others) will define their own DTD to
provide for a uniform representation of data in specific areas. Many already
did in as diverse areas as real estate or chemistry. Although a large portion
of the Web will remain unstructured or hidden behind interactive graphics
and forms, a very large and content-wise essential portion of the Web will
soon be open and available in XML.
Given this, we propose to study and build a dynamic World Wide XML
warehouse. Indeed, we plan to design a data warehouse capable of storing
all the XML data available on the planet. XML is still in infancy, so this
is not a very big challenge yet1
but we believe that it will soon become. So, a major issue is scalability.
The problems we address are typical warehousing problems such as change
control or data integration. They acquire in the Web context a truly distinct
flavor. More precisely, the research directions are as follows:
-
storage: we need to efficiently store and retrieve huge quantities
of XML data (hundreds of millions of pages) with a granularity that goes
beyond the document. For this, we use the Natix store that is being developed
at U. of Mannheim and is tailored to manage tree data.
-
query processing: we plan to use the official query language for
XML when available. A lot of our effort w.r.t. query processing is in developing
the appropriate indexing mechanism. A major issue is the size of indexes.
-
data acquisition: data will typically be obtained by pull (e.g.,
discovered by crawling the Web) and push (e.g., publication by Web servers).
We are working on techniques to get the Xyleme repository as best as possible
up-to-date with the Web. For that, the strategy to refresh pages is essential.
-
change control: data on the Web is rapidly changing and users are
often interested in such changes. We are working on services such as change
monitoring and query subscription.
-
semantic data integration: we want to free users from having to
deal with specific DTDs when expressing queries, when typically there will
be many DTDs available for the specific domain of interest. In particular,
we are investigating automatic data integration based on thesauri of terms
and clustering techniques of DTDs.
-
architecture: Only parallelism can handle such a volume of data
(terabytes) and workload. We are distributing the data on a local network
of Linux PCs which raises a number of system issues.
-
data cleaning and duplicate elimination are important issues that
we do not address for the moment.
Clearly, these goals are very challenging technically. However, one may
question the usefulness of the approach. Certain functionalities of Xyleme
are standard and are or will soon be found in many systems. For instance,
from a query viewpoint, search engines will certainly take advantage of
XML structure to provide (like Xyleme) attribute-equal-value search and
support query languages for XML when they will become available. To see
another example, from a monitoring viewpoint, some systems such as MindIt
already offer some form of monitoring of the Web. What more can be brought
by a all-Web warehousing approach?
This will be best illustrated by a couple of examples. First, consider
query processing. Certain queries requiring joins are simply not conceivable
in a Web context for performance reasons. They become feasible with a warehousing
approach. Another possible model would require that each application warehouses
the particular data it needs. We believe that a world wide warehouse such
as Xyleme is often a cost-effective alternative. Consider now versioning.
A user, interested by a specific catalog, may load the catalog at time
t. Coming back one week later, the user may want to refresh its
version of the catalog and also know what portions changed. Xyleme will
support such services that go beyond simple monitoring of document changes
and require storing versions of data and query answers. More examples of
new services supported by Xyleme will be encountered in the paper.
So, Xyleme will warehouse all the XML data of the Web, host application
data (e.g., answers to queries) and, in a second stage, will host the application
themselves.
The Xyleme project is functioning as an open, loosely coupled network
of researchers. The Verso Group at INRIA was at the origin of the project
together with F. Bancilhon, formerly CEO of O2
Technology. The database groups from Mannheim U. and the CNAM-Paris as
well as the IASI Team of University of Paris-Orsay, rapidly joined as well
as a number of individual researchers from a number of places.
The paper is organized as follows. Section X
introduces the architecture. We discuss the repository in Section
X.
Section X deals with query processing and
indexing. Data acquisition is considered in Section
X
and change control in Section X. The topic of Section
X is data integration. Section
X
is a conclusion.
2 Architecture
In this section, we discuss the architecture
of the system.
The system is functionally organized as follows:
-
physical level: a repository tailored for tree-data (Natix) and
an index manager;
-
logical level: the query processor and the acquisition of data;
-
application level: change management and semantic data integration;
-
interface level: the Web interface for accessing the Web and Xyleme
interface with clients;
-
client module.
Figure 1: Architecture
(a)
(b)
The functional architecture is pictured in Figure X.
All modules are written in C++ except for the client that is mostly supposed
to run on standard Web browsers with Java applets. The material architecture
consists of PCs running Linux. As shown in Figure X,
we (logically) distinguish between several kinds of machines:
-
repository machines: each is responsible for a large collection
of XML document. (Each DTD is assigned to a given repository. So all documents
with the same DTD will be on the same machine.)
-
Index machines: they handle indexes for a large portion of the data.
This is a very resource demanding task. These machines have lots of memory.
The number of these machines should be expected to grow linearly with the
number of pages known by Xyleme.
-
Web-interface machine(s): it is in charge of crawling and accessing
data on the Web. For the moment, there is a single such machine.
-
Acquisition machines: they are in charge of controlling the (re)loading
of Web pages. Some data intensive programs are run periodically that require
spreading to several machines. Otherwise, this is a rather modest task.
-
Application machines: each is responsible of the management of some
queries, change management and semantic integration. These are suppose
roughly to grow linearly with the number of users.
3 Repository
In this section, we discuss the repository.
Xyleme requires the use of an efficient, update-able storage of XML
data. This is an excellent match for Natix developed at the U. of Mannheim.
Typically, two storage approaches have been used for such data: (i)
store XML pages as byte streams or (ii) store the data/DOM tree in a conventional
DBMS. The first presents the disadvantages that it privileges a certain
granularity for data access and requires parsing to obtain the structure.
The second artificially creates lots of objects for even medium-size documents
and this together with the additional layers of a DBMS tend to slow down
processing. Natix uses a hybrid approach. Data are stored as trees
until a certain depth where byte streams are used. Furthermore, some minimum
structure is used in the ``flat'' part to facilitate access with flexible
levels of granularity, Also, Natix offers very flexible means of choosing
the border between the object and flat worlds.
Natix is built, in a standard manner, on top of a record manager with
fixed-size pages and segments (sequences of pages). Standard functionalities
for record management (e.g., buffering) or fast access (e.g., indexes)
are provided by Natix. The performances of Xyleme will heavily depend on
the efficiency of the repository. The use of a relational system would
result in very bad clustering and some waste of space for every single
element. An object system would better tackle clustering but would also
be very space consuming. We believe Natix will provide better performances
because it takes advantage of the particularities of the data. The first
measures that are available seem to support that belief. More extensive
tests are clearly needed.
Perhaps the most interesting aspect of the system is the splitting of
XML trees into pages. An example is shown in Figure X.
Observe the use of ``proxy'' nodes (p-nodes) and ``aggregate'' nodes (h-nodes).
One can control the behavior of the split by specifying at the DTD level,
which splits are not allowed, are forced, or are left to the responsibility
of the system.
Figure 2: Distributing nodes on records
4 Query Processing
Queries live at many levels in the system:
queries to the ``abstract DTDs'' (see Section X),
change queries (see Section
X), or actual Xyleme
queries issued directly. Ultimately, all result in a complex query being
managed by the system. This is the topic of the present section.
Query processing is certainly not a new topic. As a matter of fact,
Xyleme query processor relies mainly on relational and object-oriented
techniques. Still, the particularities of XML and the huge size of the
warehouse we are considering raise a number of interesting problems.
-
Pattern Matching: XQL [RoLaSc1998],
as well as most XML query languages, provides some pattern matching facilities
that are needed, e.g., to express the following query:
Find the documents
in which there is a product somewhere below a company whose name resembles
``Xyleme''.
Object or relational algebras do not support this functionality, although
it could be added, using some limited form of recursion (transitive closure).
Still we believe it is important to provide a pattern matching operator,
that can eventually be rewritten into more standard ones, but that can
also has its own special implementation (see below). Thus, in a manner
similar to [ChClSi2000, ChClMo1996],
Xyleme optimizer uses an object algebra to which we added a new operator
called PatternScan.
-
Full-Text Indexes: Traditionaly, full text indexes return a list
of documents in which a word occur. We are considering answering more complex
queries, and notably, we want to use them to implement the PatternScan
operator (see above). In order to do that efficiently, we extend these
indexes by annotating each occurence of a word with some code that allows
to locate it in the document, relatively to some other word. To guarantee
good performances, the word encoding must have some properties.
(i) It must be dense. Some experiments we made showed that by adding
2 integers to each word of a full text index, we multiply its size by a
factor of 1.5. Since, in order to provide performances similar to those
of current search engines, we need to keep Xyleme indexes in memory, we
should be very careful about the way we encode words if we do not want
to manage twice as many machines as needed.
(ii) It should be update-able. The update of a full text index is an
expensive operation that has to be performed each time a document is loaded
in the warehouse. One way to minimize the cost of this operation, is to
make a diff between old and new versions of the document and to
add/remove only the elements that have been changed. However, most encoding
schemes we found do not support updates well. Notably, the insertion of
one element can invalidate the codes of all others.
(iii) It should help us identify elements in the store. If there is
no connection between index encoding and the way documents are stored on
disk, the index can only be used to filter documents, not elements. Thus,
we cannot exploit the different levels of granularity supported by Natix.
So far, we did not find an encoding scheme that had these three properties.
We are planning on using two codes. One relies on structural information
to avoid duplicating the internal nodes of a document. Although, not dense
in general, we believe it should be a good candidate for, e.g., XML data
generated from databases. The other encoding is borrowed from the information
retrieval community and encodes intervals (one word is below another if
its interval is included in that of the other word).
-
Incomplete and Approximative Answers: One intended use of Xyleme
is as a sophisticate search engine. People searching the Web, are usually
very impatient, do not care for large answers and certainly do not accept
``no'' as a satisfactory answer. In order to take this facts into account,
Xyleme query processor uses lower and upper bounds: the processing cannot
stop until we found a minimal number of answers and should stop as soon
as we found some maximal number. The lower bound introduces approximativity:
if we do not find an exact answer, we somehow have to relax the query.
This is done in several steps, each removing some structure to the query.
The final step leads to a standard keyword search. The upper bound introduces
incompleteness. Obvioulsy, data coming from important documents (see Section
X) should be returned in priority. This raises
many non trivial problems when queries are distributed or involve joins.
Currently, we are considering picking answers according to processing time
rather than importance.
-
Queries against Views: Although a query plan usually ends up being
distributed over several machines, it is generated on one single machine
so as to avoid (i) duplicating optimization efforts and (ii) unecessary
communications. To achieve reasonnable performances, all the information
needed by this process must be available on the machine it is running on,
preferably in RAM. This information includes views. As will be explained
in Section X, views are used in Xyleme to provide
a simple and clever interface to a huge number of documents described by
heterogeneous DTDs. Accordingly, a view definition can be very large, certainly
not fitting one single RAM. Still, some of this information is needed if
we want to address the appropriate machines with the appropriate plans.
Again, we use some encoding schemes to represent paths in some documents.
Although the three properties we listed above are still accurate, updatability
is more critical here.
5 Data Acquisition
In this section, we consider the issue of data
acquisition. In a first stage, we crawl the Web in search of XML
data. This is a standard task. We implemented our own crawler that goes
at the honest speed of 100 000 pages per hour. Note that most pages are
not stored or indexed by Xyleme, e.g., gif pages or even html
pages. Like every crawler we know of, the crawler is parameterized by some
depth k. The semantics is that, when discovering a new HTML page
in a site, the crawler should not look deeper than k without exiting
that site. This tends to privilege breadth vs. depth crawling to some extent.
To be able to crawl fast, we need, given an URL, to be able to decide very
fast whether it is new to Xyleme or not. For that, we use URL ``signatures''
and a large hash table at the cost of (statistically rarely) detecting
that we already know a page when we do not.
A critical issue is the strategy to decide when to refresh a document.
Xyleme administrators should be able to specify some general acquisition
policies, e.g., whether it is more urgent to explore new branches of the
Web or to refresh the documents we already have. The decision to refresh
certain documents is based on criteria such as:
-
Publication data: In push mode, the owner of the document may let
Xyleme know when a change occurs or may specify a frequency of refreshment.
-
Temporal information about a document such as last-time-read or
its change rate. A document that has not changed for 6 months is likely
to be an archive and its refresh frequency should be very low. At the other
extreme, a document that changes all the time may represent real-time data
such as stock exchange quotes that Xyleme should not attempt to keep in
pace with.
-
Factual information such as the document size or its importance
as determined by Xyleme.
Periodically (e.g., a couple of times a day) Xyleme determines the next
pages to refresh. The strategy is a dynamic optimization. More precisely,
we try to minimize, for each time period (say half a day), a particular
cost function under some constraint. The cost is defined as the dissatisfaction
of users being presented stale data; the constraint is the average number
of pages that Xyleme can read per that time period. The cost function is
designed so that important and rapidly changing pages are read more often.
Also, eventually all pages get read (no eternal starvation) unless there
is an explicit decision by the administrator to leave a portion of the
Web out, e.g., pages below a certain importance threshold.
From this, it should be clear that the quality of page acquisition depends
on the quality of the guiding parameters. It is reasonably easy to estimate
the change rate of a page based on how many times a change was detected
in the last (say 100) attempts to refresh the page. On the other hand,
it is more difficult to estimate the importance of the page. The number
of access is difficult to evaluate and may be meaningless if the page is
accessed often uselessly. We prefer to use a page ranking in the style
of Google [ibmrank, google].
The intuition is that an important page carries its importance to the pages
it points to. This leads to computing the fixpoint of a function over the
set of pages.
We run our page ranking program on the graph of XML and HTML
pages. This in itself is not a simple task because of the hundreds of millions
of HTML pages. But clearly, the importance of XML pages cannot be regarded
in isolation. The program provides the importance of all pages of the Web
and can be used (i) to rank XML pages in query results and (ii) guide the
refreshing of XML pages. Observe also that to perform this page ranking,
we need the link structure2
of the Web. The maintenance of this data structure, when pages are constantly
loaded, is also a very intense task.
Note that we also have to refresh HTML pages since the new version of
an ``old'' HTML page may lead to discover new XML pages. To decide when
to refresh such HTML pages, the above ranking is not useful. For instance,
a very important HTML page is useless for us if it does not lead to XML
pages. We use another definition of importance for HTML pages (importance
for Xyleme) which leads to another fixpoint evaluation. This entire process
is rather expensive but it is only run periodically and measures for our
first prototype show that it is feasible even if we consider hundreds of
millions of HTML pages.
6 Change Control
Users are often not only interested in the current
values of documents but also in their evolution. They want to see changes
as information that can be consulted or queried. Indeed, change detection
and notification is becoming an important part of Web services, e.g., [mindit,
doubletwist, ice].
Since Xyleme is a Web warehouse, it is impossible to have a real time
vision of data. For instance, the time of a document version is the time
when Xyleme decides to read or refresh the document. For each page, Xyleme
thus gets snapshots of the page. Given two consecutive versions of page
D at time ti-1
and ti, we can compute the modifications
Di-1,i
that occurred in between using a diff algorithm. We use the diff
of Shasha and Wang [shasha-algo]. Such
deltas can be used to manage the monitoring of changes. It can also serve
as the basis for a query subscription and a versioning mechanism. Observe
however that several functionalities that we will consider may be very
space consuming. In particular, we will not be able to version all
the XML pages of the Web.
We discuss next a number of services that Xyleme is addressing:
-
Client store refreshing: A Xyleme client may request some documents, collections
of documents, or query result, say Vi
at time i and store this data on its local computer. The client
may then request Di,now
to refresh its store or see what changed. This is in the spirit of the
Information and Content Exchange, ICE [ice, oasis-ice],
a standard for electronic commerce.
-
Monitoring changes: This is the first facet of a query subscription system
we are implementing. At the time we compute a
delta, we verify for
each atomic change (by simple lookups in an index), whether this change
is monitored by some subscription. Examples of changes that can be monitored
include, for instance, any modification of a document or the insertion
of a new item in a catalog.
ICE also provides a protocol for notifications. New services such
as MindIt [mindit] allow to monitor changes
at the page level in the Web. We plan to support monitoring at the node
level as well.
-
Querying changes: Deltas are XML documents and thus may be queried like
any other documents. For instance, one will be able to ask for the list
of all items introduced in a catalog since January, or for the movies that
started in Paris this week. Queries against changes form a second facet
of the query subscription system. We believe that they get their full power
as continuous queries [Niagara], i.e.,
queries asked regularly to the warehouse. We believe that continuous queries
over change often are what the users are interested in, e.g., notify me
once a week of the top 20 new documents about XML.
-
Versioning: We will be able to obtain an old version of a document or part
of one. Although less important, this may be sometimes useful. (Note that
a version mechanism is useful beyond, e.g., for client refresh.)
-
Temporal Queries: In some sense, this can be viewed as another way to query
changes. The queries are addressed to a (virtual) logical representation
of the data history using a temporal query language.
To represent versions, we opted for a change-centric approach as
opposed to a data-centric one that is more natural for database versioning.
We use deltas in the style of [hull].
Deltas are often lossy and cannot be inverted. As often done in physical
logs, we use completed deltas which also contain additional information
so they can be reversed. In our version representation, all XML nodes are
assigned a persistent identifier, that we call XID for Xyleme ID. The use
of persistent identifiers is essential to describe changes and also query
changes efficiently. For each versioned document, we store the current
version, an XID-map (an original way to represent the IDs of this version)
and the ``forward unit completed deltas'' (an XML document). An example
of this is shown in Figure X. In the figure, the
XIDs are shown in the tree although in reality they are not stored inside
the document but in the XID-map only.
An essential aspect of our management of versions is that, when a new
version is installed, the new unit delta is appended to the existing delta
for that page. This can be done without having to update data in place
in the store. The price for it is that we store redundant information.
A delta compression technique is used periodically to recover space.
Figure 3: Storage of a versioned document
7 Semantic Data Integration
In this section, we address the issue of providing
some form of semantic data integration in Xyleme. More precisely, for a
given application domain, we want to offer users the means to access data
without bothering to learn the exact DTD or multiple DTDs that structure
data in this particular domain. In particular, a user may want to ask queries
that extract and combine data from XML documents with different DTDs without
having the possibly heavy task of expressing such queries over several
DTDs directly. We would like to express queries at an abstract and semantic
level and the system to automatically reformulate the demand into actual
queries in Xyleme query language.
To achieve this goal requires (i) some extra knowledge of the semantics
of the particular DTDs, (ii) some processing to analyze the particular
knowledge, and (iii) the compilation of such abstract queries into actual
Xyleme queries.
An optimal solution for us would be that this extra knowledge be directly
provided by the designer of the particular DTDs or some domain expert as
meta-data in languages such as RDF or XML-Schema. Such approaches will
be essential in the long range and we are studying their impact. But for
now, we should not expect to find the knowledge we need in this manner
since the standards are not yet agreed upon. We consider how to obtain
automatically such knowledge or how it can be provided by an expert.
Such knowledge can be automatically acquired using available ontologies
or thesauri. Several general or specialized ontologies exist. WordNet,
for instance, is a general linguistic ontology providing a hierarchical
organization of synonym sets, each one representing one underlying lexical
concept. The hierarchical categories of topics provided by the search engine
Yahoo! can also be seen as such an ontology. The problem is then to plug
in or match existing ontologies or thesauri with DTDs known by Xyleme.
The problem varies depending on whether the Xyleme data that we consider
are pulled or pushed. We can expect from the people who push data to the
Xyleme server that they can characterize their data relatively to the list
of the given topics which constitute the Xyleme ontology. For pulled data,
the problem is different because we have to find a way to connect the structure
of the DTD with our given set of topics.
Our first work in that direction is the design and implementation of
scalable machine learning techniques in order to automatically classify
DTDs into clusters, according to similarities existing between subsets
of DTDs and/or XML documents. The expected volume (thousands and soon more)
makes it quite challenging. Once a cluster of actual DTDs relating to a
unique domain has been isolated, the next task is to automatically generate
an abstract DTD covering the common aspects of these DTDs and their
particularities. Such processing is done off-line and leads to the incremental
building or enrichment of a set of forms constituting semantic views over
the Xyleme repository. The connection between that semantic layer and the
data stored in Xyleme relies on the automatic translation from the language
of forms to the Xyleme query language.
We are currently investigating and implementing these two major tasks
(semantic classification and integration). We are also experimenting with
more ``manual'' approaches. More precisely, humans are specifying the mappings
between concrete and abstract DTDs in two application domains, culture
(museum) and travels. Clearly, the automatic and manual approaches could
be used alternatively to build rich semantic domains.
In this setting, the technology we use consists of knowledge representation
techniques that offer declarative formalisms for expressing semantic and
flexible views over data, together with (i) inference algorithms that provide
services for automatic classification and (ii) query rewriting using views.
8 Conclusion
As mentioned earlier, Xyleme is a project distributed
between several groups mostly in France but in Germany as well. The design
of Version 1 of the system (planed for end of 2000) is almost complete.
The repository, the crawler and the page ranker are running. The rest code-wise
varies from mostly done to still in limbo. We intend to discover new problems
along the way and we believe that the experience will remain as challenging
and exciting as it has been so far.
-
1
-
The number of accessible XML pages is marginal for the moment, less that
1% of the number of HTML pages. However, we believe that many companies
already switched to XML and that the number of XML pages will explode as
soon as XML will be supported by most of the installed browsers. within
a year?
-
2
-
For instance, with 200 million pages, 1.5 billion links were found in [graphstructure].
This gives some idea of the size of the data structure that has to be maintained
and the difficulty of the task.
This document was translated from LATEX
by HEVEA.