Element Sets: A Minimal Basis for an XML Query Engine

Abstract

This note draws on experience in search and retrieval of tagged text to propose a minimum set of functionality required for an XML query facility to be useful.

The argument is that a facility which offers set arithmetic plus inclusion and containment is sufficient to express a high proportion of queries, and lends itself to an efficient implementation.

Basis of Experience

Between 1986 and 1996 I was employed, first at the New Oxford English Dictionary project, then at Open Text Corporation, in both places almost entirely concerned with the problems of search and retrieval in large structured text-bases. The technology developed for the New OED project at the University of Waterloo and further refined and marketed by Open Text was a search engine named pat. Pat implemented a facility much like that described in this note. It was applied successfully to the 572 million characters of the electronic OED and variety of other text-bases, including service in one of the first large-scale Internet search engines, the now-defunct Open Text Index. Pat (under a variety of different product names) remains in active use by Open Text customers (although the company has retreated from the chronically-unprofitable search and retrieval business.)

The data to which pat was applied was typically tagged in the SGML style, but did not have a DTD - in effect, XML. The scheme described in this note is thus based on 10 years of real-world experience in search and retrieval of XML.

Notation

In this document, I will provide concrete examples of a query language syntax, simply because this makes things easy to explain; I am not proposing an alternative query syntax for XML.

An Element Set

An Element Set is just that; a set of elements. The elements may be empty or have content, which may or may not be mixed.

Any collection of XML documents has a collection of predefined Element Sets, one for each element type occurring in the documents. The advent of the XML namespace technology means that element types can be unambiguously shared (or not) across documents as desired. A predefined set can be referred to simply by namespace name and element type:

Inclusion and Containment

The most basic operations that make Element Sets useful are inclusion and containment. We use the verb including for the first, and within for the second.

The following example finds the set of all paragraphs in the introduction that contain one or more footnotes and also one or more cross references.

set1 = Set('Paragraph') within Set('Introduction')
set2 = set1 including Set('footnote')
set3 = set2 including Set('xref')

Set Arithmetic

The standard set of set-arithmetic operations are provided for Element Sets. We use + to express union, ^ to express intersection, and - to express set difference. No operator is strictly necessary for symmetric difference, although one could be provided.

The following example finds the set of paragraphs in the introduction that contain either a cross reference or a footnote but not both.

set1 = Set('Paragraph') within Set('Introduction')
set2 = set1 including Set('footnote')
set3 = set1 including Set('xref')
set4 = (set2 + set3) - (set2 ^ set3)

Set Generators

Working only with the predefined sets that corresond to the element types is not very interesting. Interesting element sets are those generated by search operations:

set1 = Set('Title', contains="introduction")
set2 = Set('Title', attribute=("xml:lang", "en-US"))

This kind of basic search is primitive compared to what is offered by most commercial information retrieval products. Such products allow the search to rely on morphological processing, word form recognition, thesauri, and a variety of other search heuristics. Unfortunately, most of these techniques suffer from poor portability across languages.

Proximity searching is one technique that can be used across languages, assuming it is character rather than word proximity. Character proximity is required because individual words in running text are not easily identified in many languages including Chinese, Japanese, and German.

In any case, the string-match or search syntax should be extensible to allow the controlled introduction of those advanced search techniques that prove both useful and internationalizable.

What Element Sets Leave Out

Element Sets notably lack any direct access to the tree structure of the XML document such as methods for finding parent, child, and sibling elements. On the surface, this seems like a profoundly important omission, but in practice, it seems to be the case that there are few real-world queries that need to do real tree-walking as opposed to ancestor-descendant processing. In my 10 years' experience in selling the Waterloo/Open Text software, there were very few cases where this absence was a serious obstacle to the usability of the software. (On the other hand, to be fair, our main competitor, then known as Electronic Book Technology, now known as Inso, did offer tree-walking capabilities and claimed there was good application for them).

Conclusion

Element Sets are a remarkably simple technology that which have been proven in industrial practice to be tractable to implement, even for very large text bases, and capable of solving a wide range of real-world problems. I think it is fair to conclude that they represent, more or less, the minimum required for such a facility to be of practical use. Clearly, there are many other advanced features one could wish for in an XML query facility. I suggest that for each such additional piece of functionality, we characterize the class of queries which it enables, above and beyond those enabled by Element Sets, and engage in a cost-benefit analysis.