Projection and Transformation in XML Queries

1 Introduction

As more and more data is represented in XML or exchanged in XML, many applications are finding the ability to intelligently query this data essential. Querying XML data requires both the ability to inductively use the structure found in existing documents; however, it also requires the ability to combine data from different sources and restructure it for various purposes. This paper discusses how one query language can meet both needs.

In 1998, two query languages for XML rose to prominence in the web community. These languages were based on different models, and were designed for different kinds of queries. One of these languages, XQL, came from the XML structured document community, and is closely related to the current W3C XPath language, used in both XSL and XPointers. XQL '98 is a tree oriented query language that returns projections of the original document, maintaining the relative sequence and hierarchy among nodes returned by a query; XML '99 extends this language with simple transformations and joins. The other language, XML-QL, originated in the semi-structured database community, and focusesd on the ability to transform and restructure data. XML-QL is similar in many ways to other semi-structured database languages like Lorel and YATL, which are perhaps less known in the web community. For the sake of simplicity, we will use XML-QL as a representative of the semi-structured languages in this paper.

XQL and XML-QL evolved with different usage scenarios in mind, and with different ideas about what is most important. Nevertheless, there is a striking degree of similarity in certain aspects of the languages. Unfortunately, some of the queries that drove the design of each language are impossible to express in the other language, including queries that are helpful for common uses of XML. Some people have suggested that one approach is the proper approach for data, and the other is the proper approach for documents. These people advocate the use of different query languages for each purpose. If the two categories were distinct, this would be a reasonable approach; however, the distinction between traditional data and traditional documents is rapidly fading, in large part due to XML itself. In a very real sense, traditional documents are data found in a format that humans can use, but which is difficult for programs to process in any meaningful way; similarly, traditional data is data found in a format convenient for computers, but less convenient for human beings.When human-readable documents are marked up using XML, the data in those documents is made available for processing by both programs and query languages. This reduces the need to force humans to enter their information in the constrained formats available in traditional databases. In fields as diverse as patient records, financial reports, intelligence, and human resources, the ability to embed traditional data in the narrative contributes greatly to the usefulness of documents. Documents that can be read and used by both humans and programs are extremely useful, especially if XML query languages are able to flexibly perform both traditional document-oriented queries and traditional data-oriented queries on these documents.

Therefore, we feel that an XML query language should learn from both semi-structured query languages and document-oriented query languages. This paper marks a step in that direction. In this paper, we discuss extensions that could be added to XQL in order to create a language that has the expressive power of both XQL and XML-QL. Some of these extensions, including cross-document references and joins, are found in the August 9th Working Draft of XQL (http://metalab.unc.edu/xql/xql-proposal-1999Aug09.html). Others, including support for cross products, are introduced in this paper for the first time. Work on these features is in rapid progress at the time this paper is being written, and we anticipate that the specific syntax presented here may not be identical to the syntax finally adopted for XQL. However, our goal in this paper is not to present a particular syntax, but to show the possibility and desireability of a language that combines most of the power of XQL and XML-QL, and to give examples that illustrate how such a language might be used.

Currently, there is no W3C XML Query Language. Although a W3C XML activity has just started, it may take some time for such a language to be developed. In the meantime, many products are implementing existing XML query languages, and some who have designed such languages are participating in the W3C XML Query Language activity, including designers of the languages discussed in this paper. Two papers written as preparation for this activity were central in framing the problems addressed in this paper. One, "XML Query Languages: Experiences and Exemplars" (http://www-db.research.bell-labs.com/user/simeon/xquery.html), discusses a variety of queries that involve combining and transforming data, showing how various query languages address these needs. The other paper, "The Tree Structure of XML Queries" (http://metalab.unc.edu/xql/tree-structure.html), focuses on the need to inductively preserve structure, a need addressed particularly well by XQL, and addressed less well by most semi-structured query languages. This is functionality that we will be careful to preserve in the proposals discussed here.

We will begin our paper by discussing a few important examples presented in those two papers, discuss the needs illustrated by these examples, and propose extensions to XQL to enable it to combine and transform data more powerfully.

Note: The ideas presented here are inspired and influenced by discussions with various members of the W3C XML Query WG.. The authors of "XML Query Languages: Experiences and Exemplars", Mary Fernandez, Jerome Simeon, and Philip Wadler, have been extremely responsive and helpful in suggesting ways to add functionality which they felt was missing from XQL, and I am grateful for their expertise and encouragement. An earlier version of Node Generators called the for() construct was suggested by Philip Wadler, and the manner in which elements are created is stolen fairly directly from XML-QL. Peter Fankhauser, Karl Aberer, and Ingo Macherius of the GMD-IPSI have also been extremely helpful in helping to bridge the cultural gap between XQL and semi-structured databases, and in offering specific suggestions for language design.

2 Binding and Constructing: XML-QL

Semi-structured query languages are extremely good at combining information from one or more sources and restructuring it to create a new type of result. For instance, suppose we have books in a bibliography, and we wish to list each author along with the titles of books written by that author (This example and the others in this section are adapted from [9].) The original data might look like this:

<?xml version="1.0"?>
<!DOCTYPE bib SYSTEM "books.dtd">
<bib> 
  <book year="1955"> 
	 <title>Harold and the Purple Crayon</title> 
	 <author> 
		<last>Johnson</last> 
		<first>Crockett</first> 
	 </author> 
	 <publisher>Harper and Row</publisher> 
	 <price>$4.76</price> 
  </book> 
  <book year="1956"> 
	 <title>Harold's Fairy Tale</title> 
	 <author> 
		<last>Johnson</last> 
		<first>Crockett</first> 
	 </author> 
	 <publisher>Harper and Row</publisher> 
	 <price>$4.76</price> 
  </book> 
</bib>

Now we would like to transform this data. Let's start with a fairly simple query that lists each author in the bibliography. Each author should occur only once in the output. The results of the query are as follows:

<results>
   <result> 
       <author> 
         <last>Johnson</last> 
         <first>Crockett</first> 
       </author> 
   </result> 
</results>

This query is not possible in XQL, which insists on producing one <author/> element in the output for each <author/> found in the input. In XML-QL, this is a straightforward query:

CONSTRUCT <results> {
  WHERE
    <bib>
      <book>
        <author><firstname>$f</firstname><lastname>$l</lastname></author>
      </book>
    </bib> IN "http://www.bn.com/bib.xml"
  CONSTRUCT
    <result>
      <author><firstname>$f</firstname><lastname>$l</lastname></author>
    </result>
} </results>

In XML-QL, a CONSTRUCT clause is used to build the results of a query. In the beginning of the first CONSTRUCT clause above, a <results/> element is created. In the second CONSTRUCT clause, at the end of the query, a <result/> element and an <author/> element are created. A WHERE clause is used to specify patterns that the query will seek. In the above query, the WHERE clause looks for <bib/> elements which contain <book/> elements which contain <author/> elements, binding the variable $f to the value of each <firstname/> element and $l to the value of each <lastname/> element found within such an author. These variables are then used in the CONSTRUCT clause to create new <author/> elements.

Now let's modify this query to list each author, then list each title of a book written by that author. The output of our query should look like this:

<results>
   <result> 
       <author> 
         <last>Johnson</last> 
         <first>Crockett</first> 
       </author> 
       <title>Harold and the Purple Crayon</title> 
       <title>Harold's Fairy Tale</title> 
   </result> 
</results>

This requires a more complex query:

CONSTRUCT <results> {
  WHERE
    <bib>
      <book>
        <author><last>$l</last><first>$f</first></author>
      </book> 
    </bib> IN "http://www.bn.com/bib.xml"
  CONSTRUCT 
    <result> 
      <author><last>$l</last><first>$f</first></author>
      {
        WHERE 
         <bib>
           <book>
             <title>$t</title>  // join on $l and $f
             <author><last>$l</last><first>$f</first></author>
           </book>
         </bib> IN "http://www.bn.com/bib.xml"
        CONSTRUCT <title>$t</title>
      }
    </result>
} </results>

To understand the above query, it is important to know that the variables in the pattern described by a WHERE clause is matched all at once to create a tuple, and tuples from a WHERE clause are passed to the CONSTRUCT clause as flat, unordered relations. The variables bound by a WHERE clause can then be used to construct documents whose structure may be quite different from that found in the original document. In the above example, for each distinct set of values $l, and $f matched by the first WHERE clause, one <author/> element is created. In the second WHERE clause, these values are used to express a join that selects all books written by an author with the same first and last name. For each such book, the final CONSTRUCT clause inserts the title.

This approach is extremely powerful for transformations, but the flat, unordered relations used to communicate between the WHERE and CONSTRUCT clause make it difficult or impossible to express certain kinds of queries that preserve the hierarchy and sequence of the original data, since the order and hierarchy of the original are not preserved in these relations. These are the kinds of structure-preserving queries for which XQL is particularly well suited, and which are the subject of the next section. On the other hand, XQL is simply unable to perform the queries discussed in this section, and these are clearly useful queries. In the last section, we will explore ways of adding this power to XQL.

3 Projecting from the Document: XQL

When query results are returned, XQL maintains the hierarchy and sequence of information found in a document. This makes some kinds of queries much easier in XQL than in XML-QL, and also makes it possible to perform some queries that can not be done in XML-QL. For an example of a query that is easier in XQL, consider a query that returns just the title and author from each book in the sample data we have been using, placing them in a <book/> element. The result should look like this:

<xql:result>
<bib>
   <book> 
	      <title>Harold and the Purple Crayon</title> 
       <author> 
         <last>Johnson</last> 
         <first>Crockett</first> 
       </author> 
   </book> 
   <book> 
	      <title>Harold's Fairy Tale</title> 
       <author> 
         <last>Johnson</last> 
         <first>Crockett</first> 
       </author> 
   </book> 
</bib>
</xql:result>

The <xql:result> element found in the query results is an artifact of the way that XQL serializes query results to produce text, and would not be present in a query environment that returns nodes or hyperlinks.

Here is the XQL query that was used to produce the above results:

document("http://www.bn.com")/bib {
    book {
        title | author
  }
}

In this query, the document() method takes one parameter, an URI, and returns the document node for the document to which it resolves. From this document, the <bib/> element at the root is found. The {} operator returns an empty element whose type is given by the element to the left of the braces; in the first line of the query, this is the <bib/> element. Within this <bib/> element, all <book/> elements are returned, together with the union of their <title/> and <author/> elements. This is straightforward because this query relies on the structure of the original document, and XQL easily preserves this structure.

The corresponding XML-QL query is much more complicated, using a nested query to preserve the nested structure of the original:

CONSTRUCT <bib> {
  WHERE
    <bib>
      <book>
        <title>$t</title>
      </book> CONTENT_AS $b
    </bib> IN "www.bn.com/bib.xml"
  CONSTRUCT 
    <book> 
      <title>$t</title>
      {  WHERE <author>$a</author> IN $b
         CONSTRUCT <author>$a</>
      }
    </book>
} </bib>

Now let's look at some queries that XML-QL can not do. The first example involves generating a table of contents for a document. Suppose we have data that contains nested sections within chapters. A chapter might look something like this:

<Chapter><Title>Desserts</Title>
    <Para>This chapter tells you all about yummy desserts.</Para>
    <Section><Title>Cookies</Title> 
      <Para>Cookies are flat things that sometimes have chocolate chips.</Para>
    </Section>
    <Section><Title>Candy</Title> 
      <Para>Candy is dandy.</Para>
    </Section>
    <Section><Title>Pies and Tarts</Title> 
      <Para>Pies and tarts are even better.</Para>
    </Section>
    <Section><Title>Cakes, Tortes, and Cupcakes</Title> 
      <Para>Cakes, Tortes, and Cupcakes are absolutely awesome.</Para>
    </Section>
</Chapter>

Obviously, this is a toy document created to hide the complexity and length that would be found in a real document. Suppose we want to generate a table of contents. It may be a table of contents printed on paper, indenting for nested sections and presenting sections in their correct order, or it may involve hyperlinks into a document. Regardless, an important aspect of generating a table of contents is the ability to extract the <Chapter/>, <Section/>, and <Title/> elements, maintaining their relationship. For instance, we may want to generate the following result:

<xql:result>
<Chapter><Title>Desserts</Title>
    <Section><Title>Cookies</Title> 
    </Section>
    <Section><Title>Candy</Title> 
    </Section>
    <Section><Title>Pies and Tarts</Title> 
    </Section>
    <Section><Title>Cakes, Tortes, and Cupcakes</Title> 
    </Section>
</Chapter>
</xql:result>

Here is the XQL query that generates the above results:

//(Chapter | Section | Chapter/Title | Section/Title)

In the above query, "//" indicates that the specified elements are to be found no matter where they occur in the document. The rest of the query simply says to return any <Chapter/> or <Section/> elements, or any <Title/> elements found as immediate children of <Chapter/> or <Section/> elements. Note that the above query says nothing about the sequence in which results should be returned, nor does it say how to nest the results. This information comes from the document that is queried, and its original hierarchy and sequence are preserved, producing the desired results.

Not only can XQL preserve structures found in document trees, it is also very strong in its ability to express conditions based on the tree structure of XML documents. This enables conditions on hierarchy and sequence to be combined in flexible and powerful ways. The semi-structured query languages are based on a general graph representation which does not easily take advantage of the containment and sequence relationships found in document trees. Consider the following data, provided to us by Dr. Tom Lincoln, taken from a patient record encoded in an HL7 format:

<section><section.title>Procedure</section.title> 
  The patient was taken to the operating room where she was placed in supine position and 
  <Anesthesia>induced under general anesthesia.</Anesthesia> 
  <Prep>
  <action>Foley catheter was placed to decompress the bladder</action> 
  and the abdomen was then prepped and draped in sterile fashion. 
  </Prep>
  <paragraph /> 
  <Incision>
  A curvilinear incision was made 
  <Geography>in the midline immediately infraumbilical</Geography> 
  and the subcutaneous tissue was divided 
  <Instrument>using electrocautery.</Instrument> 
  </Incision>
  The fascia was identified and 
  <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> 
  <Incision>
  The fascia was divided using 
  <Instrument>electrocautery</Instrument> 
  and the peritoneum was entered. 
  </Incision>
  <Observation>The small bowel was identified</Observation> 
  and 
  <action>
  the 
  <Instrument>Hasson trocar</Instrument> 
  was placed under direct visualization. 
  </action>
  <action>
  The 
  <Instrument>trocar</Instrument> 
  was secured to the fascia using the stay sutures 
  </action>
<!-- The section continues, but we excerpt it here... -->

A variety of useful queries involve sequential relationships within some element. For instance, we may be interested in knowing whether any procedures do not record use of anaesthesia before the first incision. Here is an XQL query that seeks such procedures:

 //section[section.title="Procedure"][not(.//Anaesthesia before .//Incision)]

Fortunately, no such procedures are found in the above data. Similarly, we may wish to know what actions were made after the first incision but before the second incision. Here is the XQL query:

 //section[section.title="Procedure"]//((action after Incision[1]) before Incision[2])

Here are the results returned by that query:

<xql:result>
   <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> 
</xql:result>

4 Transforming XQL

By now, we hope we have established that both semi-structured languages and XQL have important strengths. These languages were designed for different sets of use cases, and with different kinds of data in mind. In today's world, XML and the web have encouraged people to combine data from various data sources with document data in very flexible ways, and the strengths of each set of languages are very relevant. In the rest of this paper, we explore ways to add transformations like those of semi-structured languages to XQL.

Our approach is to adopt mechanisms similar to those found in semi-structured languages for variable binding, joins, sorting, and element creation. Although the August 1999 Working Draft of XQL'99 has constructs for variable bindings and joins, this paper uses mechanisms more similar to those found in semi-structured languages, which we feel are semantically cleaner. In addition, with the adoption of XPath as a W3C Recommendation, we have chosen to base our syntax on XPath, with a few extensions from XQL. The language we call XQL in this paper differs from XPath in the following ways:

XPath Axes are not used in these queries.
The 'before' and 'after' operators used in these examples come from XQL, and are not part of XPath.
Node Generators, Joins, and Sorting, as introduced in this paper, are not found in XPath.

Some distinguishing aspects of semi-structured languages will not be adopted in this paper, as adopting them would lose some of the strengths of XQL. We do not adopt the convention of multiple clauses that communicate via flat, unordered relations, tree structure continues to be basic to the structure of XQL as modified in this paper, and XQL evaluation is not fundamentally changed. Nevertheless, it would probably be fair to describe the resulting language as a semi-structured language. We believe that the results demonstrate the feasibility of creating a language that combines the best qualities of each approach.

4.1 Simplifying Element Creation

In existing versions of XQL, there is a function that creates an element, but element creation is relatively rare and is not supported by all implementations. However, once we add the ability to perform transformations, element creation will become a common operation in XQL queries, so we want to give it a straightforward syntax. Both XML-QL and XSL-T build element structures by inserting XML start and end tags in the clauses that create results. We will adapt their notation for XQL.

An element tag by itself creates an empty element. For instance, the following query creates an empty Bib element:

<Bib>

Some XQL environments need to store queries in settings that do not allow the "<" and ">" characters. In these environments, "{[" and "]}" can be used as synonyms for the angle brackets:

{[Bib]}

Content may be added to a newly created element by placing it within the curly braces that follow it. For instance, the following query creates a Bib element, then adds all Book elements to the newly created Bib element:

<Bib> {
    //Book
}

4.2 Node Generators

Node generators create a node for each value of a subquery. Node generators use the same notation as node creation, but contain a subquery that assigns its value to a variable for each result. For instance, in the following example, one <Book> element is created for each value of the query "//Book", and the author and title are added to this new element:

<Bib> {
    <Book> ($b := //Book) {
        $b/author, $b/title
    }
}

Many of the queries that are possible in XML-QL but not in XQL involve grouping results by value. For instance, the section on XML-QL showed a query in which the name of each author was listed once, followed by the titles of books written by that author. An author's name occurs once on each of these books, but the query lists each author's name only once. To make this possible, node generators may take the "unique" keyword, which removes duplicate values. For instance, the following query lists each author once, no matter how many times the author's name occurs on a book:

<Bib> {
    <Author> (unique $a := //Author) {
        $a
    }
}

The above query requires the query processor to compare values for uniqueness. Structured values are compared by normalizing their whitespace and representing any elements or attributes as markup, then using a string comparison. For instance, consider the following Author element:

<Author>
    <First> Crockett </First> 
    <Last> Johnson </Last> 
</Author>

Suppose the previous query were to encounter this particular author. The variable would be assigned to the contents of the author, and for the sake of comparisons, leading and trailing whitespace would be removed. The resulting string looks like this:

<First>Crockett</First><Last>Johnson</Last>

4.3 Variables and Joins

XQL'99 is capable of joins, but the mechanism for binding variables has been somewhat messy. We can use our new, more elegant manner of binding variables to express joins that integrate information from various sources. For instance, here is a query that lists the title and author of books found on one web site together with reviews from a second and used bookstore listings from a third:

document("http://www.bn.com/bib.xml")/bib/book 
<book> ($i := isbn) {
        title | author
        | document("http://www.amazon.com/reviews.xml")/reviews/entry[isbn=$i]
        | document("http://www.bookfinders.com/listings.xml")[dealers/books/book[isbn=$i]]
  }

Joins can be performed using any set of bound variables. For instance, here is a join taken from the section on XML-QL near the beginning of the paper, listing each author once together with the books written by that author:

<results> {
	document("http://www.bn.com/bib.xml")/bib
 <result> {
	    <author> (unique $a := book/author) {
		      $a,  document("http://www.bn.com/bib.xml")/bib/book[author=$a]/title
		   }
}

4.4 More Sophisticated Construction

Node creation and node generators allow sophisticated construction of results to be performed relatively painlessly. For instance, here is a query that creates an HTML document with a table that contains the names of authors and the titles of books they have written:

<HTML>
  <HEAD> { 
    <TITLE> {
      "Books"
    }
  }
  <BODY> {
    <TABLE> {
       document("http://www.bn.com/bib.xml")/bib/book {
           for (unique $a := author)
              for (unique $t := title) {
               <TR> {
                 <TD> { $a }
                 <TD> { $t }  
              }
           }
       }
    }
  }
}

4.5 Links

XQL '99 can already perform queries across links using the document() method, which we have been using liberally in our examples. Let us modify our bibliography data format to include a hyperlink in the <author/> element. This link points to a document listing the works of this particular author:

<?xml version="1.0"?>
<!DOCTYPE bib SYSTEM "books.dtd">
<bib> 
  <book year="1955"> 
	 <title>Harold and the Purple Crayon</title> 
	 <author HREF="http://www.biblio.com/~jcrockett/books.xml"> 
		<last>Johnson</last> 
		<first>Crockett</first> 
	 </author> 
	 <publisher>Harper and Row</publisher> 
	 <price>$4.76</price> 
  </book> 
</bib>

Now let's use document() to return the complete list of books by the author of "Harold and the Purple Crayon", as listed in the document to which the hyperlink points:

document("http://www.bn.com/bib.xml")
  //author[title="Harold and the Purple Crayon"] {
    document(@HREF)//title
}

4.6 Sorting

A sorting operator allows values in a sequence to be sorted. The following query sorts books by title and the last name of the first author:

document("http://www.bn.com/bib.xml")/bib {
    book sortby title, author[0]/last
}

The following example also lists books, sorted by title, with only the title and the author in the results. Authors are sorted by their last and first names:

document("http://www.bn.com/bib.xml")/bib {
    book { 
        title, 
        author sortby last, first
    } sortby title
}

6 Bibliography

[1] R.Baeza-Yates, G.Navarro Integrating contents and structure in text retrieval. SIGMOD Record, 25(1):67--79, Mar. 1996.
[2] P. Buneman. Tutorial: Semistructured data. In Proceedings of ACM Symposium on Principles of Database Systems, pages 117--121, 1997. http://www.acm.org/pubs/citations/proceedings/pods/263661/p117-buneman/
[3] R. G. Cattell. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.
[4] P. Cotton, A. Malhotra. Candidate Requirements for XML Query, 1998. In W3C Workshop on Query Languages for XML. http://www.w3.org/TandS/QL/QL98/pp/queryreq.html.
[5] J. Clark, S. DeRose. XML Path Language (XPath) Version 1.0. W3C Proposed Recommendation, 8 October 1999. http://www.w3.org/TR/1999/PR-xpath-19991008.
[6] J. Clark. XSL Transformations (XSLT) Version 1.0 W3C Proposed Recommendation, 8 October 1999. http://www.w3.org/TR/1999/PR-xslt-19991008.
[7] C. L. A. Clarke, G. V. Cormack and F. J. Burkowski An algebra for structured text search and a framework for its implementation. In The Computer Journal, 38(1), 1995.
[8] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for XML. In International World Wide Web Conference, 1999. http://www.research.att.com/~mff/files/final.html
[9] M. Fernandez, J. Siméon, P. Wadler, 1999. XML Query Languages: Experiences and Exemplars. http://www-db.research.bell-labs.com/user/simeon/xquery.html.
[10] M. Murata. Data model for document transformation and assembly (extended abstract). In Principles of Digital Document Processing '98, Vol. 1481 of Lecture Notes in Computer Science. Springer-Verlag Inc., 1998.
[11] G.Navarro and R.Baeza-Yates. Proximal nodes: a model to query document databases by contents and structure. ACM TOIS, 1997.
[12] G.H. Gonnet and F.W. Tompa. Mind your grammar: a new approach to modelling text. In VLDB 1987, pp. 339--346, 1987.
[13] D. Maier. Database Desiderata for an XML Query Language. In W3C Workshop on Query Languages for XML. http://www.w3.org/TandS/QL/QL98/pp/maier.html.
[14] J. Robie. The design of XQL, 1999. http://www.texcel.no/whitepapers/xql-design.html.
[15] J. Robie, editor. XQL '99 Proposal, 1999. http://metalab.unc.edu/xql/xql-proposal.html.
[16] J. Robie. The Tree Structure of XML Queries, 1999. http://metalab.unc.edu/xql/xquery-tree.html.
[17] R.Sacks-Davis, T.Arnold-Moore, and J.Zobel. Database systems for structured documents. IEICE Transactions on Information and Systems, E78-D(11):1335--1342, Nov. 1995.