W3C

The XML Query Algebra

W3C Working Draft 04 December 2000

This version:
http://www.w3.org/TR/2000/WD-query-algebra-20001204
Latest version:
http://www.w3.org/TR/query-algebra
Editors:
Peter Fankhauser (GMD-IPSI) <fankhaus@darmstadt.gmd.de>
Mary Fernández (AT&T Labs - Research) <mff@research.att.com>
Ashok Malhotra (IBM) <petsa@us.ibm.com>
Michael Rys (Microsoft) <mrys@microsoft.com>
Jérôme Siméon (Bell Labs, Lucent Technologies) <simeon@research.bell-labs.com>
Philip Wadler (Avaya Communication) <wadler@avaya.com>

Abstract

This document introduces the XML Query Algebra as a formal basis for an XML query language.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is a First Public Working Draft for review by W3C Members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". This is work in progress and does not imply endorsement by the W3C membership.

This document has been produced as part of the W3C XML Activity, following the procedures set out for the W3C Process. The document has been written by the XML Query Working Group.

The purpose of this document is to present the current state of the XML Query Algebra and to elicit feedback on its current state. The XML Query Working Group feels it has made good progress on this document but this is still a work in progress and is liable to change in future versions. A number of important issues are still open - see also [B.3.1 Open Issues]. Comments on this document should be sent to the W3C mailing list www-xml-query-comments@w3.org (archived at http://lists.w3.org/Archives/Public/www-xml-query-comments/).

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.

Table of contents

1 Introduction
2 The Algebra by Example
    2.1 Data and types
    2.2 Projection
    2.3 Atomic data
    2.4 Iteration
    2.5 Selection
    2.6 Quantification
    2.7 Join
    2.8 Restructuring and grouping
    2.9 Querying order
    2.10 Sorting
    2.11 Aggregation
    2.12 Expanded names
    2.13 Functions
    2.14 Structural recursion
    2.15 Functions for all well-formed documents
    2.16 Top-level queries
3 Algebra Components
    3.1 Expanded names
    3.2 Wildcards
    3.3 Atomic types
    3.4 Types : abstract syntax
    3.5 Types : concrete syntax
    3.6 Expressions
4 Equivalences
    4.1 Relating projection to iteration
    4.2 Reducible Expressions
    4.3 Laws
5 Type Rules
    5.1 Relating data to types
    5.2 Equivalences and subtyping
    5.3 Factoring types
    5.4 Environments
    5.5 Expressions
    5.6 Operators and built-in functions
    5.7 Iteration expressions
    5.8 Match expressions
    5.9 Top-level expressions
6 References

Appendices

A The XML Query Data Model
B Issues
    B.1 Introduction
    B.2 Issues list
    B.3 Alphabetic list of issues
        B.3.1 Open Issues
        B.3.2 Resolved (or redundant) Issues

1 Introduction

This document proposes an algebra for XML Query.

This work builds on long standing traditions in the database community. In particular, we have been inspired by systems such as SQL, OQL, and nested relational algebra (NRA). We have also been inspired by systems such as Quilt, UnQL, XDuce, XML-QL, XPath, XQL, and YaTL. We give citations for all these systems below.

In the database world, it is common to translate a query language into an algebra; this happens in SQL, OQL, and NRA, among others. The purpose of the algebra is twofold. First, the algebra is used to give a semantics for the query language, so the operations of the algebra should be well-defined. Second, the algebra is used to support query optimization, so the algebra should possess a rich set of laws. Our algebra is powerful enough to capture the semantics of many XML query languages, and the laws we give include analogues of most of the laws of relational algebra.

It is also common for a query language to exploit schemas or types; this happens in SQL, OQL, and NRA, among others. The purpose of types is twofold. Types can be used to detect certain kinds of errors at compile time and to support query optimization. DTDs and XML Schema can be thought of as providing something like types for XML. The XML Query algebra uses a simple type system that captures the essence of [XSchema1]. The type system is close to that used in XDuce [HP2000]. On this basis, the XML Query algebra is statically typed. This allows to determine and check the output type of a query on documents conforming to an input type at compile time rather than at runtime. Compare this to the situation with an untyped or dynamically typed query language, where each individual output has to be validated against a schema at runtime, and there is no guarantuee that this check will always succeed.

The document is organized as follows. A tutorial introduction is presented in [2 The Algebra by Example]. After reading the tutorial, the reader will have seen the entire algebra and should be able to write example queries. The components of the algebra are described in detail in [3 Algebra Components]. We present some equivalence and optimization laws of the algebra in [4.3 Laws]. Finally, we give the static typing rules for the algebra in [5 Type Rules]. Although it contains the most challenging material, we have tried to make the content as accessible as possible. Nonetheless, this section is not necessary for understanding or using the algebra, therefore the reader may skip this section. In [B.2 Issues list], we discuss open issues and problems.

In Appendix [A The XML Query Data Model], we give a formal mapping relating the algebra to the XML Query Data Model.

Cited literature includes: monads [Mog89], [Mog91], [Wad92], [Wad93], [Wad95], SQL [Date97], OQL [BK93], [BKD90], [CM93], NRA [BNTW95], [Col90], [LW97], [LMW96], Quilt [Quilt], UnQL [BFS00], XDuce [HP2000], XML-QL [XMLQL99], XPath [XPath], XQL [XQL99], and YaTL [YAT99].

2 The Algebra by Example

This section introduces the main features of the algebra, using familiar examples based on accessing a database of books.

2.1 Data and types

Consider the following sample data:

    <bib>
      <book year="1999" isbn="1-55860-622-X">
        <title>Data on the Web</title>
        <author>Abiteboul</author>
        <author>Buneman</author>
        <author>Suciu</author>
      </book>
      <book year="2001" isbn="1-XXXXX-YYY-Z">
        <title>XML Query</title>
        <author>Fernandez</author>
        <author>Suciu</author>
      </book>
    </bib>

Here is a fragment of a XML Schema for such data:

    <xsd:group name="Bib">
      <xsd:element name="bib">
        <xsd:complexType>
          <xsd:group ref="Book"
                     minOccurs="0" maxOccurs="unbounded"/>
        </xsd:complexType>
      </xsd:element>
    </xsd:group>
    
    <xsd:group name="Book">
      <xsd:element name="book">
        <xsd:complexType>
          <xsd:attribute name="year" type="xsd:integer"/>
          <xsd:attribute name="isbn" type="xsd:string"/>
          <xsd:element name="title" type="xsd:string"/>
          <xsd:element name="author" type="xsd:string" maxOccurs="unbounded"/>
        </xsd:complexType>
      </xsd:element>
    </xsd:group>

This data and schema is represented in our algebra as follows:

    type Bib =
      bib [ Book{0, *} ]
    type Book =
      book [
        @year  [ Integer ] & 
        @isbn  [ String ],
        title  [ String ], 
        author [ String ]{1, *}
      ]
    let bib0 : Bib =
      bib [
        book [
          @year  [ 1999 ], 
          @isbn  [ "1-55860-622-X" ], 
          title  [ "Data on the Web" ], 
          author [ "Abiteboul" ], 
          author [ "Buneman" ], 
          author [ "Suciu" ]
        ], 
        book [
          @year  [ 2001 ], 
          @isbn  [ "1-XXXXX-YYY-Z" ], 
          title  [ "XML Query" ], 
          author [ "Fernandez" ], 
          author [ "Suciu" ]
        ]
      ] 

The expression above defines two types, Bib and Book, and defines one global variable, bib0.

The Bib type corresponds to a single bib element, which contains a forest of zero or more Book elements. We use the term forest to refer to an sequence of (zero or more) attributes and elements. Every attribute or element can be viewed as a forest of length one.

The Book type corresponds to a single book element, which contains one year attribute and one isbn attribute, followed by one title element, followed by one or more author elements. The & operator, called an all group, indicates that the year and isbn attributes may occur in any order. A isbn attribute and a title or author element contains a string value, and a year attribute contains an integer.

The variable bib0 is bound to a literal XML value, which is the data model representation of the earlier XML document. The bib element contains two book elements.

The algebra is a strongly typed language, therefore the value of bib0 must be an instance of its declared type, or the expression is ill-typed. Here the value of bib0 is an instance of the Bib type, because it contains one bib element, which contains two book elements, each of which contain an integer-valued year attribute, a string-valued isbn attribute, a string-valued title element, and one or more string-valued author elements.

For convenience, we define a second global variable book0 also bound to a literal value, which is equal to the first book in bib0.

    let book0 : Book = 
        book [
          @year  [ 1999 ], 
          @isbn  [ "1-55860-622-X" ], 
          title  [ "Data on the Web" ], 
          author [ "Abiteboul" ], 
          author [ "Buneman" ], 
          author [ "Suciu" ]
        ]

2.2 Projection

The simplest operation is projection. The algebra uses a notation similar in appearance and meaning to path navigation in XPath. The following expression returns all author elements contained in book elements contained in bib0:

    bib0/book/author
==> author [ "Abiteboul" ], 
    author [ "Buneman" ], 
    author [ "Suciu" ], 
    author [ "Fernandez" ], 
    author [ "Suciu" ]
:   author [ String ] {0, *}
 

Note that in the result, the document order of author elements is preserved and that duplicate elements are also preserved.

The above example and the ones that follow have three parts. First is an expression in the algebra. Second, following the ==> is the value of this expression. Third, following the : is the type of the expression, which is (of course) also a legal type for the value.

It may be unclear why the type of bib0/book/author contains zero or more authors, even though the type of a book element contains one or more authors. Let's look at the derivation of the result type by looking at the type of each sub-expression:

    bib0             : Bib
    bib0/book        : Book{0, *}
    bib0/book/author : author [ String ]{0, *}
 

Recall that Bib, the type of bib0, may contain zero or more Book elements, therefore the expression bib0/book might contain zero book elements, in which case, bib0/book/author would contain no authors.

This illustrates an important feature of the type system: the type of an expression depends only on the type of its sub-expressions. It also illustrates the difference between an expression's run-time value and its compile-time type. Since the type of bib0 is Bib, the best type for bib0/book/author is one listing zero or more authors, even though for the given value of bib0, the expression will always contain exactly five authors.

2.3 Atomic data

One may access atomic data (strings, integers, or booleans) using the keyword data(). For instance, if we wish to select all author names in a book, rather than all author elements, we could write the following.

    book0/author/data()
==> "Abiteboul",
    "Buneman",
    "Suciu"
:   String {1, *}

Similarly, it is possible to project the atomic values of attributes. The following returns the year the book was published.

    book0/@year/data()
==> 1999
:   Integer

This notation is similar to the use of text() in XPath. We chose the keyword data() because, as the second example shows, not all data items are strings.

2.4 Iteration

Another common operation is to iterate over elements in a document so that their content can be transformed into new content. Here is an example of how to process each book to list the author before the title, and remove the year and isbn.

    for b in bib0/book do
      book [ b/author, b/title ]
==> book [
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ], 
      title  [ "Data on the Web" ]
    ], 
    book [
      author [ "Fernandez" ], 
      author [ "Suciu" ], 
      title  [ "XML Query" ]
    ]
:   book [
      author[ String ]{1, *}, 
      title[ String ]
    ]{0, *} 

The for expression iterates over all book elements in bib0, and binds the variable b to each such element. For each element bound to b, the inner expression constructs a new book element containing the book's authors followed by its title. The transformed elements appear in the same order as they occur in bib0.

In the result type, a book element is guaranteed to contain one or more authors followed by one title. Let's look at the derivation of the result type to see why:

    bib0/book        : Book{0, *}
    b                : Book
    b/author         : author [ String ]{1, *}
    b/title          : title  [ String ]

The type system can determine that b is always Book, therefore the type of b/author is author[ String ]{1, *}, and the type of b/title is title[ String ].

In general, the value of a for expression is a forest. If the body of the for expression itself yields a forest, then all of the forests are concatenated together. For instance, the expression:

    for b in bib0/book do
      b/author

is exactly equivalent to the expression bib0/book/author.

Here we have explained the typing of for expressions by example. In fact, the typing rules are rather subtle and will be explained in detail in [5 Type Rules].

2.5 Selection

To select values that satisfy some predicate, we use the where expression. For example, the following expression selects all book elements in bib0 that were published before 2000.

    for b in bib0/book do 
      where b/@year/data() <= 2000 do
        b
==> book [
     @year  [ 1999 ], 
     @isbn  [ "1-55860-622-X" ], 
     title  [ "Data on the Web" ], 
     author [ "Abiteboul" ], 
     author [ "Buneman" ], 
     author [ "Suciu" ]
    ] 
:   Book{0, *} 

In general, an expression of the form:

where e1 do e2

is converted to the form

if e1 then e2 else ()

where e1 and e2 are expressions. Here () is an expression that stands for the empty sequence, a forest that contains no attributes or elements. We also write () for the type of the empty sequence.

According to this rule, the expression above translates to

for b in bib0/book do
  if b/@year/data() <= 2000 then b else () 

and this has the same value and the same type as the preceding expression.

2.6 Quantification

The following expression selects all book elements in bib0 that have some author named "Buneman".

    for b in bib0/book do
      for a in distinct(b/author/data()) do
        where a = "Buneman" do
          b
==> book [
     @year  [ 1999 ], 
     @isbn  [ "1-55860-622-X" ], 
     title  [ "Data on the Web" ], 
     author [ "Abiteboul" ], 
     author [ "Buneman" ], 
     author [ "Suciu" ]
    ] 
:   Book{0, *}

In contrast, we can use the empty operator to find all books that have no author whose name is Buneman:

    for b in bib0/book do
      where empty(for a in b/author do
                    where a/data() = "Buneman" do
                      a) do
        b
==> book [
      @year  [ 2001 ], 
      @isbn  [ "1-XXXXX-YYY-Z" ], 
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

The empty expression checks that its argument is the empty sequence ().

We can also use the empty operator to find all books where all the authors are Buneman, by checking that there are no authors that are not Buneman:

    for b in bib0/book do
      where empty(for a in b/author do
                    where a/data() <> "Buneman" do
                      a) do
        b
==> ()
:   Book{0, *}

There are no such books, so the result is the empty sequence. Appropriate use of empty (possibly combined with not) can express universally or existentially quantified expressions.

Here is a good place to introduce the let expression, which binds a local variable to a value. Introducing local variables may improve readability. For example, the following expression is exactly equivalent to the previous one.

    for b in bib0/book do
      let nonbunemans = (for a in b/author do
                           where a/data() <> "Buneman" do
                             a) do
        where empty(nonbunemans) do
          b

Local variables can also be used to avoid repetition when the same subexpression appears more than once in a query.

2.7 Join

Another common operation is to join values from one or more documents. To illustrate joins, we give a second data source that defines book reviews:

    type Reviews  =
      reviews [
        book [ 
          title  [ String ], 
          review [ String ]
        ]{0, *}
      ]
    let review0 : Reviews =
      reviews [
        book [
          title  [ "XML Query" ], 
          review [ "A darn fine book." ] 
        ], 
        book [
          title  [ "Data on the Web" ], 
          review [ "This is great!" ] 
        ]
      ] 

The Reviews type contains one reviews element, which contains zero or more book elements; each book contains a title and a review.

We can use nested for expressions to join the two sources review0 and bib0 on title values. The result combines the title, authors, and reviews for each book.

    for b in bib0/book do
      for r in review0/book do
        where b/title/data() = r/title/data() do
          book [ b/title, b/author, r/review ]
==> 
    book [
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
      review [ "A darn fine book." ] 
    ], 
    book [
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
      review [ "This is great!" ] 
    ]
:   book [
      title [ String ], 
      author [ String ] {1, *}, 
      review [ String ]
    ] {0, *}

Note that the outer-most for expression determines the order of the result. Readers familiar with optimization of relational join queries know that relational joins commute, i.e., they can be evaluated in any order. This is not true for the XML Query Algebra: changing the order of the first two for expressions would produce different output. In [B.2 Issues list], we discuss extending the algebra to support unordered forests, which would permit commutable joins.

2.8 Restructuring and grouping

Often it is useful to regroup elements in an XML document. For example, each book element in bib0 groups one title with multiple authors. This expression groups each author with the titles of his/her publications.

    for a in distinct(bib0/book/author/data()) do
      biblio [
        author[a], 
        for b in bib0/book do
          for a2 in b/author/data() do
            where a = a2 do
              b/title
      ]
==> biblio [
      author [ "Abiteboul" ], 
      title  [ "Data on the Web" ]
    ], 
    biblio [
      author [ "Buneman" ], 
      title  [ "Data on the Web" ]
    ], 
    biblio [
      author [ "Suciu" ], 
      title  [ "Data on the Web" ], 
      title  [ "XML Query" ]
    ], 
    biblio [
      author [ "Fernandez" ], 
      title  [ "XML Query" ]
    ]
:   biblio [
      author [ String ], 
      title  [ String ] {0, *}
    ] {0, *}

Readers may recognize this expression as a self-join of books on authors. The expression distinct(bib0/book/author) produces a forest of author elements with no duplicates. The outer for expression binds a to each author element, and the inner for expression selects the title of each book that has some author equal to a.

Here distinct is an example of a built-in function. It takes a forest of elements and removes duplicates by content; the order of the resulting forest is not defined.

The type of the result expression may seem surprising: each biblio element may contain zero or more title elements, even though in bib0 every author co-occurs with a title. Recognizing such a constraint is outside the scope of the type system, so the resulting type is not as precise as we would like.

2.9 Querying order

Often it is useful to query the order of elements in a forest. There are two kinds of order among elements. Document or global order refers to the total order among all elements in a document and corresponds to their sequential appearance. Local order refers to the order among sibling elements in a forest. Currently, the algebra supports querying of local order, but not global order. We discuss global order in [B.2 Issues list].

The index function pairs an integer index with each element in a forest:

    index(book0/author)
==> pair [ fst [ 1 ], snd [ author [ "Abiteboul" ] ] ], 
    pair [ fst [ 2 ], snd [ author [ "Buneman" ] ] ], 
    pair [ fst [ 3 ], snd [ author [ "Suciu" ] ] ]
:   pair [ fst [ Integer ], snd [ author [ String ] ] ] {1, *} 

Note that the result type guarantees that at least one pair exists in the result, because (book0/author) always contains one or more authors.

Once we have paired authors with an integer index, we can select the first two authors:

    for p in index(book0/author) do
      where (p/fst/data() <= 2) do
        p/snd/author
==> author [ "Abiteboul" ], 
    author [ "Buneman" ]
:   author [ String ] {0, *}

The for expression iterates over all pair elements produced by the index expression. It selects elements whose index value in the fst element is between one and two inclusive, and it returns the original content in the snd element.

Once again, the result type may be surprising, because the Book type guarantees that each book has at least one author. However, the type system cannot determine that the conditional where clause will always succeed, so the inner clause may produce zero results. (A sophisticated analysis might improve type precision, but is likely to require much work for little benefit.)

2.10 Sorting

To sort a forest, the algebra provides a sort expression, whose form is: sort Var in Exp1 by Exp2. The variable Var ranges over the items in the forest Exp1 and sorts those items using the key value defined by Exp2. For example, this expression sorts book elements in review0 by their titles. Note that the variable b appear in the key expression b/title.

    sort b in review0/book by b/title
==> book [ title  [ "Data on the Web" ], 
           review [ "This is great!" ] ], 
    book [ title  [ "XML Query" ], 
           review [ "This is pretty good too!" ] ]
:   book [ title [ String ], review [ String ] ] ] {0, *}

The sort operator is a restricted form of higher-order function, i.e., it takes a function as an argument. In this case, sort takes a single function, which should extract the key value from each element.

2.11 Aggregation

We have already seen several several built-in functions: index, distinct, and sort. In addition to these functions, the algebra has five built-in aggregation functions: avg, count, max, min, and sum.

This expression selects books that have more than two authors:

    for b in bib0/book do
      where count(b/author) > 2 do
        b
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

All the aggregation functions take a forest with repetition type and return an integer value; count returns the number of elements in the forest.

2.12 Expanded names

So far, all our examples of attributes and elements use unqualified local names, i.e., names that do not include an explicit namespace URI. It is also possible to specify and match on the expanded name of an attribute or element. The expanded name {Namespace}LocalName consists of a namespace URI Namespace and a local name LocalName.

Consider an inventory of books that contains data from both http://www.BooksRUs.com and http://www.cheapBooks.com. In this example, the first element contains values whose names are defined in the BooksRUs.com namespace, and the second element contains values whose names are defined in the cheapBooks.com namespace:

    let inventory = 
      inv [ 
        {http://www.BooksRUs.com/books.xsd}book [
          @{http://www.BooksRUs.com/books.xsd}year  [ 1999 ], 
          @{http://www.BooksRUs.com/books.xsd}isbn  [ "1-55860-622-X" ],
          {http://www.BooksRUs.com/books.xsd}title  [ "Data on the Web" ], 
          {http://www.BooksRUs.com/books.xsd}author [ "Abiteboul" ], 
          {http://www.BooksRUs.com/books.xsd}author [ "Buneman" ], 
          {http://www.BooksRUs.com/books.xsd}author [ "Suciu" ]
        ], 
        {http://www.cheapBooks.com/ourschema.xsd}book [
          @{http://www.cheapBooks.com/ourschema.xsd}year  [ 2001 ], 
          {http://www.cheapBooks.com/ourschema.xsd}title  [ "XML Query" ], 
          {http://www.cheapBooks.com/ourschema.xsd}author [ "Fernandez" ], 
          {http://www.cheapBooks.com/ourschema.xsd}author [ "Suciu" ]
          {http://www.cheapBooks.com/ourschema.xsd}isbn   [ "1-XXXXX-YYY-Z" ], 
        ]
       ] 
     ] : Inventory
    type Inventory = inv [ InvBook{0, *}]

In this example, elements imported from existing schemas each refer to a single namespace, thus the definition of InvBook is:

    type BooksRUBook = 
      {http://www.BooksRUs.com/books.xsd}book [
        @{http://www.BooksRUs.com/books.xsd}year  [ Integer ] & 
        @{http://www.BooksRUs.com/books.xsd}isbn  [ String ],
        {http://www.BooksRUs.com/books.xsd}title  [ String ], 
        {http://www.BooksRUs.com/books.xsd}author [ String ]{1, *}
      ]
    type CheapBooksBook = 
      {http://www.cheapBooks.com/ourschema.xsd}book [
        @{http://www.cheapBooks.com/ourschema.xsd}year  [ Integer ],
        {http://www.cheapBooks.com/ourschema.xsd}title  [ String ], 
        {http://www.cheapBooks.com/ourschema.xsd}author [ String ]{1, *},
        {http://www.cheapBooks.com/ourschema.xsd}isbn   [ String ],
      ]    
    type InvBook = BooksRUBook | CheapBooksBook

Here vertical bar (|) is used to indicate a choice between types: each InvBook is either a BooksRUBook or a CheapBooksBook.

We have already seen how to project on the constant name of an attribute or element. It is also useful to project on wildcards, which are used to match names with any namespace and/or any local name. For example, this expression matches elements with any local name and with namespace URI http://www.BooksRUs.com/books.xsd:

    inventory/{http://www.BooksRUs.com/books.xsd}* 
==> {http://www.BooksRUs.com/books.xsd}book [
      {http://www.BooksRUs.com/books.xsd}title  [ "Data on the Web" ], 
      {http://www.BooksRUs.com/books.xsd}author [ "Abiteboul" ], 
      {http://www.BooksRUs.com/books.xsd}author [ "Buneman" ], 
      {http://www.BooksRUs.com/books.xsd}author [ "Suciu" ]
    ]
:   {http://www.BooksRUs.com/books.xsd}book [
      {http://www.BooksRUs.com/books.xsd}title  [ String ], 
      {http://www.BooksRUs.com/books.xsd}author [ String ] {0, *}, 
    ]

Similarly, this expression first projects elements in any namespace whose local name is book and then projects on their year attributes:

    inventory/{*}book/@{*}year
==> @{http://www.BooksRUs.com/books.xsd}year  [ 1999 ], 
    @{http://www.cheapBooks.com/ourschema.xsd}year  [ 2001 ], 
:   (@{http://www.BooksRUs.com/books.xsd}year  [ Integer ] | 
     @{http://www.cheapBooks.com/ourschema.xsd}year [ Integer ]){0,*} 

In general, the expression Exp/a is shorthand for Exp/{*}a, that is, the namespace wildcard can be omitted. Similarly, * is shorthand for {*}*.

2.13 Functions

Functions can make queries more modular and concise. Recall that we used the following query to find all books that do not have "Buneman" as an author.

    for b in bib0/book do
      where empty(for a in b/author do
                    where a/data() = "Buneman" do
                      a) do
        b
==> book [
      @year  [ 2001 ], 
      @isbn  [ "1-XXXXX-YYY-Z" ], 
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

A different way to formulate this query is to first define a function that takes a string s and a book b as arguments, and returns true if book b does not have an author with name s.

     fun notauthor (s : String; b : Book) : Boolean =
       empty(for a in b/author do
               where a/data() = s do
                 a)

The query can then be re-expressed as follows.

    for b in bib0/book do
      where notauthor("Buneman"; b) do
        b
==> book [
      @year  [ 2001 ], 
      @isbn  [ "1-XXXXX-YYY-Z" ], 
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

We use semicolon rather than comma to separate function arguments, since comma is used to concatenate forests.

Note that a function declaration includes the types of all its arguments and the type of its result. This is necessary for the type system to guarantee that applications of functions are type correct.

In general, any number of functions may be declared at the top-level. The order of function declarations does not matter, and each function may refer to any other function. Among other things, this allows functions to be recursive (or mutually recursive), which supports structural recursion, the subject of the next section.

Functions make the algebra extensible. We have seen examples of built-in functions (sort and distinct) and examples of user-defined functions (notauthor). In addition to built-in and user-defined functions, the algebra could support externally defined functions, i.e., functions that are not defined in the algebra itself, but in some external language. This would make special-purpose implementations of, for example, full-text search functions available in the algebra. We discuss support for externally defined functions in [B.2 Issues list].

2.14 Structural recursion

XML documents can be recursive in structure, for example, it is possible to define a part element that directly or indirectly contains other part elements. In the algebra, we use recursive types to define documents with a recursive structure, and we use recursive functions to process such documents. (We can also use mutually recursive functions for more complex recursive structures.)

For instance, here is a recursive type defining a part hierarchy.

    type Part = Basic | Composite
    type Basic = 
      basic [
        cost [ Integer ]
      ]
    type Composite = 
      composite [
        assembly_cost [ Integer ], 
        subparts [ Part{1, *} ]
      ]

And here is some sample data.

    let part0 : Part =
      composite [
        assembly_cost [ 12 ], 
        subparts [
          composite [
            assembly_cost [ 22 ], 
            subparts [
              basic [ cost [ 33 ] ]
            ]
          ], 
          basic [ cost [ 7 ] ]
        ]
      ]

Here vertical bar (|) is used to indicate a choice between types: each part is either basic (no subparts), and has a cost, or is composite, and includes an assembly cost and subparts.

We might want to translate to a second form, where every part has a total cost and a list of subparts (for a basic part, the list of subparts is empty).

    type Part2 =
      part [
        total_cost [ Integer ], 
        subparts [ Part2{0, *} ]
      ]

Here is a recursive function that performs the desired transformation. It uses a new construct, the match expression.

    fun convert(p : Part) : Part2 =
      match p 
        case b : Basic do
          part [
            total_cost [ b/cost/data() ], 
            subparts []
          ]
        case c : Composite do 
          let s = (for y in children(c/subparts) do 
                     convert(y)) do
            part [
              total_cost [
                q/assembly_cost/data() + 
                  sum(s/total_cost/data())
              ], 
              subparts[ s ]
            ]
        else error

Each branch of the match expression is labeled with a type, Basic or Composite, and with a corresponding variable, b or c. The evaluator checks the type of the value of p at run-time, and evaluates the corresponding branch. If the first branch is taken then b is bound to the value of p, and the branch returns a new part with total cost the same as the cost of b, and with no subparts. If the second branch is taken, then c is bound to the value of p. The function is recursively applied to each of the subparts of c, giving a list of new subparts s. The branch returns a new part with total cost computed by adding the assembly cost of c to the sum of the total cost of each subpart in s, and with subparts s.

One might wonder why b and c are required, since they have the same value as p. The reason why is that p, b, and c have different types.

    p : Part
    b : Basic
    c : Composite

The types of b and c are more precise than the type of p, because which branch is taken depends upon the type of value in p.

Applying the query to the given data gives the following result.

    convert(part0)
==> part [
      total_cost [ 74 ], 
      subparts [
        part [
          total_cost [ 55 ], 
          subparts [
            part [
              total_cost [ 33 ], 
              subparts []
            ]
          ]
        ], 
        part [
          total_cost [ 7 ], 
          subparts []
        ]
      ]
    ]

Of course, a match expression may be used in any query, not just in a recursive one.

2.15 Functions for all well-formed documents

Recursive types allow us to define a type that matches any well-formed XML document. This type is called AnyTree :

    type AnyTree        = AnyScalar | AnyElement | AnyAttribute
    type AnySimpleType  = AnyScalar | AnyScalar{0,*}
    type AnyAttribute   = @*[ AnySimpleType ]
    type AnyElement     = *[ AnyComplexType ]
    type AnyComplexType = AnyTree{0, *}
    type AnyType        = AnySimpleType | AnyComplexType

AnyTree stands for any scalar, element, or attribute value. Here AnyScalar is a built-in scalar type. It stands for the most general scalar type, and all other scalar types (like Integer or String) are subtypes of it. AnySimpleType stands for the most general simple type, which is a scalar or a list of scalars. AnyAttribute stands for the most general attribute type. The asterisk (*) is used to indicate a wild-card type, i.e., a type whose name is not known statically. The type AnyAttribute may have any name, and its content must have type AnySimpleType, i.e., it may contain atomic values, but no elements. The AnyElement stands for the most general element type, which may have any name, and its content must be a complex type, which is a repetition of zero or more AnyTrees. Finally, an AnyTree is either an AnySimpleType or an AnyComplexType. In other words, any element, attribute, or atomic value has type AnyType.

The use of * is a significant extension to XML Schema, because XML Schema has no type corresponding to *[t], where t is some type other than AnyComplexType. It is not clear that this extension is necessary, since the more restrictive expressiveness of XML Schema wildcards may be adequate.

In particular, our earlier data also has type AnyTree.

    book0 : AnyTree
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ]
:   AnyTree

A specific type can be indicated for any expression in the query language, by writing a colon and the type after the expression.

As an example of a function that can be applied to all well-formed documents, we define a recursive function that converts any XML data into HTML. We first give a simplified definition of HTML.

    type HTML_body =
      ( AnyScalar
      | b [ HTML_body ]
      | ul [ (li [ HTML_body ]){0, *} ]
      ) {0, *}

An HTML body consists of a sequence of zero or more items, each of which is either a scalar, or a b element, where the content is an HTML body, or an ul element, where the children are li elements, each of which has as content an HTML body.

Now, here is the function that performs the conversion.

    fun html_of_xml( x : AnyTree ) : Html_Body =
      match x 
        case s : AnyScalar do s
        case v1 : AnyAttribute do 
          b [ name(v1) ], 
          ul [ for y in children(v1) do li [ html_of_xml(y) ] ],
        case v2 : AnyElement do 
          b [ name(v2) ], 
          ul [ for y in attributes(v2) do li [ html_of_xml(y) ], 
          ul [ for y in children(v2) do li [ html_of_xml(y) ] ]
        else error

The first branch of the match match expression checks whether the value of x is a subtype of AnyScalar, and if so then s is bound to that value, so if this branch is taken then s is the same as x, but with a more precise type (it must be a scalar, not an element). This branch returns the scalar.

The second branch checks whether the value of x is a subtype of AnyAttribute. As before, v1 is the same as x but with a more precise type (it must be an attribute, not a scalar). This branch returns a b element containing the name of the attribute, and a ul element containing one li element for each value of the attribute. The function is recursively applied to get the content of the li element. The last branch is analogous to the second, but it matches an element instead of an attribute, and it applies html_of_xml to each of the element's attributes and children.

Applying the query to the book element above gives the following result.

    html_of_xml(book0)
==> b [ "book" ], 
    ul [
      li [ b [ "year" ],   ul [ li [ 1999 ] ] ], 
      li [ b [ "isbn" ],   ul [ li [ "1-55860-622-X" ] ] ],  
      li [ b [ "title" ],  ul [ li [ "Data on the Web" ] ] ], 
      li [ b [ "author" ], ul [ li [ "Abiteboul" ] ] ], 
      li [ b [ "author" ], ul [ li [ "Buneman" ] ] ], 
      li [ b [ "author" ], ul [ li [ "Suciu" ] ] ]
    ]
:   HTML_Body

2.16 Top-level queries

A query consists of a sequence of top-level expressions, or query items, where each query item is either a type declaration, a function declaration, a global variable declaration, or a query expression. The order of query items is immaterial; all type, function, and global variable declarations may be mutually recursive.

Each query expression is evaluated in the environment specified by all of the declarations. (Typically, all of the declarations will precede all of the query expressions, but this is not required.) We have already seen examples of type, function, and global variable declarations. An example of a query expression is:

    query html_of_xml(book0)

To transform any expression into a top-level query, we simply precede the expression by the query keyword.

3 Algebra Components

In this section, we summarize the algebra and present the grammars for types and expressions. We start by defining the non-terminal and terminal symbols that appear in the type and expression grammars.

3.1 Expanded names

A Namespace is a URI, and a LocalName is an NCName, as in the Namespace recommendation [XML Names]. We let Namespace range over namespaces and the null value, which represents the absence of a namespace, and LocalName range over NCNames. The symbol Name ranges over expanded names. An expanded name is either a local name for which the namespace is absent or namespace, local name pairs.

namespace Namespace ::= URI
| Null
local name LocalName ::= NCName
expanded name Name ::= LocalName shorthand for {Null}LocalName
| {Namespace}LocalName

Ed. Note: MF (Oct-13-2000): We need to add the data model accessors that extract the namespace and the local name from an expanded name.

3.2 Wildcards

A wildcard denotes a set of names. A wildcard item is of the form {*}* denoting any name in any namespace, {Namespace}* denoting any name in namespace Namespace, {*}LocalName denoting the local name LocalName in any namespace, or {Namespace}LocalName denoting just the name with namespace Namespace and local name LocalName. A wildcard consists of wildcard items, union of wildcards, or difference of wildcards. We let WItem range over wildcard items, and Wild range over wildcard expressions.

wildcard item WItem ::= {*}* any namespace, any local name
| {Namespace}* namespace Namespace, any local name
| {*}LocalName any namespace, local name LocalName
| Name a constant namespace and local name
| * shorthand for {*}*
wildcard Wild ::= WItem all names in Wild
| Wild1|Wild2 any name in Wild1 or Wild2
| Wild1!Wild2 all names in Wild1 and not in Wild2

For example, the wildcard {*}*!{http://www.foo.org/baz}*!{http://www.xsd.com/xsi}String denotes any name in any namespace except for names in namespace http://www.foo.org/baz and for local name String in namespace http://www.xsd.com/xsi.

Wildcards denote sets of expanded names, so we can define a containment relation, <:wild that relate sets of wildcards. The inequalities that hold for this relation include:

{Namespace}LocalName <:wild {*}LocalName
{Namespace}LocalName <:wild {Namespace}*
{Namespace}* <:wild {*}*
{*}LocalName <:wild {*}*
{Namespace}LocalName <:wild {*}*

The last inequality holds by transitivity. Note, however, that {*}LocalName <:wild {Namespace}* does not necessarily hold.

3.3 Atomic types

The Algebra takes as given the atomic datatypes from XML Schema Part 2 [XSchema2]. We let p range over all atomic datatypes.

atomic datatype p ::= Integer | Float | String
| Boolean | URI | NCName, etc.

Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. A type hierarchy is induced between scalar types by containment of facets. Note that lists of atomic datatypes are specified using repetition and unions are specified using alternation, as defined in [3.4 Types : abstract syntax]

The built-in atomic data type AnyScalar stands for the most general scalar type, and all other scalar types (like Integer or String) are subtypes of it.

3.4 Types : abstract syntax

[Figure 1]  contains the abstract syntax for the Algebra's type system. A type is closely related to a model group as defined in [MSL]. MSL uses standard regular expression notation for model groups and we do the same for algebra types. This type system captures the essence of [XSchema1] and is similar to the type system used in XDuce [HP2000].


type variable y    
type t ::= y type variable
    | () empty sequence
    | Ø empty choice
    | Wild wildcard type
    | u unit type
    | t1 , t2 sequence, t1 followed by t2
    | t1 | t2 choice, t1 or t2
    | t1 & t2 all, t1 and t2 in either order
    | t {m, n} repetition of t between m and n times
bound m, n ::= natural number or *
unit type u ::= p atomic datatype
    | @Wild[t] attribute with name in Wild and content in t
    | Wild[t] element with name in Wild and content in t
prime type q ::= u
    | q | q


Figure 1: Abstract Syntax for Types

Ed. Note: MF (Nov-16-2000): Explain why prime type is here.

Ed. Note: MF (Nov-16-2000): Explain why Wild occurs in types.

The empty sequence matches only the empty document; it is an identity for sequence and all. The empty choice matches no document; it is an identity for choice.

() , t = t = t , ()
() & t = t = t & ()
Ø | t = t = t | Ø

The bounds on a repetition type will be either a natural number (that is, either a positive integer or zero) or the special value *, meaning unbounded. We extend arithmetic to include * in the obvious way:

m + * = * + m = *
0 · * = * · 0 = 0
m · * = * · m = * if m ¹ 0
m min * = * min m = m
m max * = * max m = *
m <= * = true
* < m = false

For technical reasons, we allow the lower bound of a repetition to be *. A repetition t {m, n} is equivalent to the empty choice Ø if m > n or if m is *.

The algebra's external type system, that is, the type definitions associated with input and output documents, is XML Schema. The internal types are in some ways more expressive than XML Schema, for example, XML Schema has no type corresponding to {*}*[t] where t is some type other than AnyComplexType. In general, mapping XML Schema types into internal types will not lose information, however, mapping internal types into XML Schema may lose information.

3.5 Types : concrete syntax

It is useful to define a concrete syntax for the Algebra's types using syntactic categories that specify various subsets of types. Syntactic classes can indicate, for example, that the content of an attribute should be a simple type and that the content of an element should consist of attributes followed by elements. These restrictions guarantee that errors in type construction can be caught during parsing of a query.

In the concrete syntax for types, we capitalize non-terminal symbols. We first distinguish between type variables used to name attribute groups, element groups, and the content types of elements.

TypeVar ::= AttrGroupVar
| ElemGroupVar
| ContentTypeVar

A simple type is either an atomic type, a list of atomic types, or a choice of atomic types, which allows a choice of atomic lists.

SimpleType ::= p atomic
| p{m,n} list of atomic
| SimpleType | SimpleType choice of atomic

An attribute group defines the syntactic form of attributes: their content may only be a simple type and they are combined only in all groups, not sequences.

AttrGroup ::= @Wild[SimpleType]
| @Wild[SimpleType]{0,1} optional attribute
| AttrGroup & AttrGroup
| AttrGroupVar
| ()

An element group contains elements with constant or wildcard names and they are combined in sequences, choices, all groups, and repetitions.

ElemGroup ::= Wild[ContentType] element
| ElemGroup , ElemGroup
| ElemGroup | ElemGroup
| ElemGroup & ElemGroup
| ElemGroup{m,n}
| ElemGroupVar
| ()
| Ø

The content type of an element is either an attribute group, an element group, a sequence of an attribute group followed by an element group, or a content-type variable.

ContentType ::= AttrGroup
| ElemGroup
| AttrGroup , ElemGroup
| ContentTypeVar content-type variable

Finally, a type in an algebraic expression may be an attribute group, an element group, or a content type:

Type ::= AttrGroup | ElemGroup | ContentType

Ed. Note: MF (Nov-16-2000): This needs work.

3.6 Expressions

[Figure 2]  contains the concrete syntax for expressions. A few of these expressions can be rewritten as other expressions in a smaller core algebra; such reducible expressions are labeled with "*". We define the algebra's typing rules on the smaller core algebra. In [4.3 Laws], we give the laws that relate a reducible expression with its equivalent expression in the core algebra. Typing rules for the core algebra are defined in [5 Type Rules].

We have seen examples of most of the expressions, so we will only point out details here. We define a subset of expressions that correspond to data values. An expression is a data value if it consists only of scalar constant, element, sequence, and empty sequence expressions.


name Name
function FuncName
variable Var
constant Const atomic-datatype constant
binary operator BinaryOp ::= + | - | and | or
    | = | != | < | <= | >= | >
unary operator UnaryOp ::= + | - | not
name expression NameExp ::= Name
| {Exp}Exp computed name
expression Exp ::= Const atomic constant
    | NameExp name expression
    | Var variable
    | @Name[Exp] attribute
    | Name[Exp] element
    | @NameExp[Exp] computed attribute
    | NameExp[Exp] computed element
    | Exp, Exp sequence
    | () empty sequence
    | if Exp then Exp else Exp conditional
    | let Var = Exp do Exp local binding
    | FuncName (Exp ;...; Exp) function application
    | Exp : Type explicit type
    | error error
    | Exp BinaryOp Exp binary operator
    | UnaryOp Exp unary operator
    | attributes (Exp) attributes
    | children (Exp) children
    | name (Exp) element name
    | for Var in Exp do Exp iteration
    | match Exp match
      case Var : Type do Exp
      ···
      case Var : Type do Exp
      else Exp
    | Exp / @Wild attribute projection *
    | Exp / Wild element projection *
    | Exp / data() atomic projection *
    | where Exp do Exp where conditional *
    | empty (Exp) emptiness predicate *
    | cast Exp : Type run-time type cast *
    | let Var : Type = Exp do Exp local variable with type *
query item Query ::= type TypeVar = Type type declaration
    | fun FuncName (Var : Type ; ... ; Var : Type): Type =
        Exp function declaration
    | let Var : Type = Exp global variable declaration
    | query Exp query expression


Figure 2: Algebra Expression Syntax

Ed. Note: MF (Oct-13-2000): We need to add the data model accessors that extract the namespace and the local name from an expanded name.

We have not defined the semantics of the binary operators in the algebra. It might be useful to define more than one type of equality over scalar and element values. We leave that to future work.

[Figure 3]  summarizes the built-in functions of the algebra.


avg, count, min, max, sum Aggregation functions
index Pairs each element of a forest with integer index.
sort Sorts elements on key value bound by variable argument.
distinct Removes duplicates from a forest.
namespace Extracts URI namespace from a name value.
localname Extracts NCName local name from a name value.


Figure 2: Built-In Functions

Ed. Note: MF (Nov-16-2000): We need a section here on operational semantics of expressions. The intent of the "Algebra by Example" section is to do this by example, but a formal operational semantics for each operator is necessary. See [Issue-0074:  Operational semantics for expressions].

4 Equivalences

4.1 Relating projection to iteration

The examples use the / operator liberally, but in fact we use / as a convenient abbreviation for expressions built from lower-level operators: for expressions, the children function, and match expressions.

For example, the expression:

    book0/author

is equivalent to the expression:

    for v in children(book0) do
      match v 
        case a : author[AnyComplexType] do a
        else ()

Here the children function returns a forest consisting of the children of the element book0, namely, a title element and three author elements (the order is preserved). The for expression binds the variable v successively to each of these elements. Then the match expression selects a branch based on the type of v. If it is an author element then the first branch is evaluated, otherwise the second branch. If the first branch is evaluated, it returns a. The variable a contains the same value as v, but the type of a is author[String], which is the intersection of the type of v and the type author[AnyComplexType]. If the second branch is evaluated, then the branch returns (), the empty sequence.

To compose several expressions using /, we again use for expressions. For example, the expression:

    bib0/book/author

is equivalent to the expression:

    for b in children(bib0) do
      match b
        case b : book[AnyComplexType] do
          for d in children(b) do
            match d
              case a : author[AnyComplexType] do a
              else ()
        else ()

The for expression iterates over all book elements in bib0 and binds the variable b to each such element. For each element bound to b, the inner expression returns all the author elements in b, and the resulting forests are concatenated together in order.

In general, an expression of the form e / a is converted to the form

    for v1 in e do
      for v2 in children(v1) do
        match v2
          case v3 : a[AnyComplexType] do v3
          else ()

where e is an expression, a is an element name, and v1, v2, and v3 are fresh variables (ones that do not appear in the expression being converted).

According to this rule, the expression bib0/book translates to

    for v1 in bib0 do
      for v2 in children(v1) do
        match v2
          case v3 : book[AnyComplexType] do v3 
          else ()

In [4.3 Laws], we discuss laws which allow us to simplify this to the previous expression:

    for v2 in children(bib0) do
      match v2
        case v3 : book[AnyType] do v3 
        else ()

Similarly, the expression bib0/book/author translates to

    for v4 in (for v2 in children(bib0) do
                 match v2
                   case v3 : book[AnyType] do v3 
                   else ()) do
      for v5 in children(v4) do
        match v5
          case v6 : author[AnyType] do v6
          else ()

Again, the laws will allow us to simplify this to the previous expression

    for v2 in children(bib0) do
      match v2
        case v3 : book[AnyType] do
          for v5 in children(v3) do
            match v5
              case v6 : author[AnyType] do v6
              else ()
        else ()

These examples illustrate an important feature of the algebra: high-level operators may be defined in terms of low-level operators, and the low-level operators may be subject to algebraic laws that can be used to further simplify the expression.

4.2 Reducible Expressions

[Figure 4]  contains the laws that relate the reducible expressions (i.e., those labeled with "*" in [Figure 2]) to equivalent expressions. In these definitions, Exp1{Exp2/ Var} denotes the expression Exp1 in which all occurrences of the variable Var are replaced by the expression Exp2.

In Rule 1, the projection expression e / Wild is rewritten as a for expression, which binds v to each element in the forest e, and returns the children elements of v whose name is in Wild. Similarly, Rule 2 rewrites the attribute projection expression e / @Wild as a match embedded in a for expression.

Rule 3 rewrites the e / data() expression by iterating over items in e. The nested match expression returns e's children if e is AnySimpleType. Otherwise, it returns the empty sequence.

Rule 4 simply rewrites a where expression by an if - then - else expression.

Rule  5 rewrites empty(e) as a match expression that returns true if the type of e is the empty sequence.

Rule  6 rewrites cast e : t as a match expression that returns the value of e with type t, if it has that type at runtime, otherwise it raises an error.

Rule 7 rewrites the let expression with a type as a let expression without a type by moving the type constraint into the expression.


e / Wild Þ for v1 in e do
  for v2 in children (v1) do
      match v2 case v3 : Wild[AnyType] do v3 else () (1)
 
e / @Wild Þ for v1 in e do
  for v2 in children (v1) do
      match v2 case v3 : @Wild[AnyType] do v3 else () (2)
 
e / data() Þ for v in e do
match children(v)
    case v : AnySimpleType do v
    else ()
(3)
 
where e1 do e2 Þ if e1 then e2 else () (4)
 
empty(e) Þ match e case v : () do true else false (5)
 
cast e : t Þ match e case v : t do v else error (6)
 
let v : Type = e1 do e2 Þ let v = (e1: Type) do e2 (7)


Figure 4: Definitions

4.3 Laws

In this section, we describe some laws that hold for the algebra. These laws are important for defining rules that simplify algebraic expressions, such as eliminating unnecessary for or match expressions.

The iteration construct of the algebra is closely related to an important mathematical object called a monad. A monad, among other things, generalizes set, bag, and list types. In functional languages, the comprehension construct is used to express iteration over set, bag, and list types. A comprehension corresponds directly to a monad [Wad92], [Wad93], [Wad95].

The correspondence between the algebra's iteration construct and a monad is close, but not exact. Each monad is based on a unary type constructor, such as Set{t} or List{t}, representing a homogenous set or list where all elements are of type t. In the algebra, we have more complex and heterogenous types, such as a forest consisting of a title, a year, and a sequence of one or more authors. Also, one important component of a monad is the unit operator, which converts an element to a set or list. If x has type t, then set{x} is a unit set of type Set{t} or [x] is a unit list of type List{t}. In the algebra, we simply write, say, author["Buneman"], which stands for both a tree and for the unit forest containing that tree.

Monads satisfy three laws, and three corresponding laws are satisfied by the the Algebra's for expression.

First, iteration over a unit forest can be replaced by substition. This is called the left unit law.

for v in e1 do e2 = e2{v := e1}

provided that e1 is a unit type (e.g., is an element or a scalar constant). We write e2{v := e1} to denote the result of taking expression e2 and replacing occurrences of the variable v by the expression e1. For example,

for v in author["Buneman"] do auth[v/data()] = auth["Buneman"]

The iteration over a forest of one item can always be eliminated using variable substitution.

Second, an iteration that returns the iteration variable is equivalent to the identity. This is called the right unit law.

for v in e do v = e

For example

for v in book0 do v = book0

An important feature of the type system described here is that the left side of the above equation always has the same type as the right side.

Third, there are two ways of writing an iteration over an iteration, both of which are equivalent. This is called the associative law.

for v2 in (for v1 in e1 do e2) do e3
= for v1 in e1 do (for v2 in e2 do e3)

For example, a projection over a forest includes an implicit iteration, so e / a = for v in e do v / a. Say we define a forest of bibliographies, bib1 = bib0, bib0. Then bib1/book/author is equivalent to the first expression below, which in turn is equivalent to the second.

for b in (for a in bib1 do a/book) do b/author
= for a in bib1 do (for b in a/book do b/author)

With nested relational algebra, the monad laws play a key role in optimizing queries. Similarly, the monad laws can also be exploited for optimization in this context.

For example, if b is a book, the following finds all authors of the book that are not Buneman:

  for a in b do
    where a/data() != Buneman do
      a

If l is a list of authors, the following renames all author elements to auth elements:

  for a' in l do
    auth[ a'/data() ]

Combining these, we select all authors that are not Buneman, and rename the elements:

  for a' in (for a in b do
    	       where a/data() != Buneman do
    		 a) do
    auth[ a'/data() ]

Applying the associative law for a monad, we get:

  for a in b do
    for a' in (where a/data() != Buneman do a) do
      auth[ a'/data() ]

Expanding the where clause to a conditional, we get:

  for a in b do
    for a' in (if a/data() != Buneman then a else ()) do
      auth[ a'/data() ]

Applying a standard law for distributing loops over conditionals gives:

  for a in b do
    if a/data() != Buneman then
      for a' in a do
        auth[ a'/data() ]
    else ()

Applying the left unit law for a monad, we get:

  for a in b do
    if a/data() != Buneman then
      auth[ a/data() ]
    else ()

And replacing the conditional by a where clause, we get:

  for a in b do
    where a/data() != Buneman do
      auth[ a/data() ]

Thus, simple manipulations, including the monad laws, fuse the two loops.

[4.1 Relating projection to iteration] ended with two examples of simplification. Returning to these, we can now see that the simplifications are achieved by application of the left unit and associative monad laws.

[Figure 5]  contains a dozen algebraic simplification laws. In a relational query engine, algebraic simplifications are often applied by a query optimizer before a physical execution plan is generated; algebraic simplification can often reduce the size of the intermediate results computed by a query evaluator. The purpose of our laws is similar -- they eliminate unnecessary for or match expressions, or they enable other optimizations by reordering or distributing computations. The set of laws given is suggestive, rather than complete.


E ::= if [] then e1 else e2
  | let v = [] in e
  | for v in [] do e
  | match []
  case v : t do e
  ...
  case v : t do e
  | else e
 
for v in () do e Þ () (8)
 
for v in (e1, e2) do e3 Þ (for v in e1 do e3), (for v in e2 do e3) (9)
 
for v in e1 do e2 Þ e2{e1/ v},  if e1 : u (10)
 
 
for v in e do v Þ e (11)
 
E[ if e1 then e2 else e3] Þ if e1 then E[e2] else E[e3] (12)
 
E[ let v = e1 do e2] Þ let v = e1 do e[e2] (13)
 
E[ for v in e1 do e2] Þ for v in e1 do E[e2] (14)
 
E[ match e1
case v : t do e2
casev : t do en-1
...
else en ]
Þ
match e1
case v : t do E[ e2 ]
...
case v : t do E[ en-1 ]
else E[en ]
(15)
 


Figure 5: Optimization Laws

Rules 8, 9, and 10 simplify iterations. Rule 8 rewrites an iteration over the empty sequence as the empty sequence. Rule 9 distributes iteration through sequence: iterating over the sequence (e1, e2) is equivalent to the sequence of two iterations, one over e1 and one over e2. Rule 10 eliminates an iteration over a single element or scalar. If e1 is a unit type, then e1 can be substituted for occurrences of v in e2.

Ed. Note: MF (Oct-18-2000) The rules for eliminating trivial match expressions need to be written. They are more complex than those for the old case expressions.

Rule 11 eliminates an iteration when the result expression is simply the iteration variable v.

Rules 12--15 are used to commute expressions. Each rule actually abbreviates a number of other rules, since the context variable e stands for a number of different expressions. The notation e [e] stands for one of the six expressions given with expression e replacing the hole [] that appears in each of the alternatives. For instance, one of the expansions of Rule 14 is the following, when e is taken to be for v in [] do e :

for v2 in (for v1 in e1 do e2) do e3 Þ for v1 in e1 do (for v2 in e2 do e3)

5 Type Rules

We explain our type system in the form commonly used in the programming languages community. The rules for typing expressions are described with an inference rule notation. Originally developed by logicians, this notation is now widely used for describing type systems and semantics of programming languages. For a textbook introduction to type systems, see, for example, [Mitchell].

In inference notation, when all judgements above the line hold, then the judgement below the line holds as well. Here is an example of a rule used later on:

|- Data1: t1      |- Data2: t2

|- (Data1, Data2): (t1, t2)

Data is the subset of expressions that consists only of scalar content, attribute, element, sequence, and empty sequence expressions.

Data ::= Const atomic constant
    | @Name[Data] attribute
    | Name[Data] element
    | Data, Data sequence
    | () empty sequence

The judgement "|- Data : t" is read as "in the empty environment, the value Data has type t." The symbol |- is called a turnstyle, and is usually preceded by an environment symbol, which represents a mapping from variables to values. In this example, there is no environment symbol, which means the judgement holds in the empty environment. In [5.5 Expressions], we will see examples of rules that have non-empty environments. The rule states that if both Data1 : t1 and Data2 : t2 hold, then (Data1, Data2): (t1, t2) holds as well. For instance, take Data1 = a[1], Data2 = b["two"], t1 = a[Integer], and t2 = b[String].

Then since both a[1] : a[Integer] and b["two"] : b[String] hold, we may conclude that (a[1], b["two"]) : (a[Integer], b[String]) holds.

5.1 Relating data to types

The following type rules relate Data values to types. The next rule states that any constant whose lexical form defines a value of type p, then the constant has type p:


|- Constp: p

The next rule states that any constant also has type AnyScalar:


|- Const : AnyScalar

The next two rules are for attribute and element construction. The first rule states that if Data has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.

|- Data : t

|- @Name[Data]: @Name[t]

|- Data : t

|- Name[Data]: Name[t]

The next rule is for typing sequences and was already described at the beginning of [5 Type Rules].

|- Data1: t1      |- Data2: t2

|- (Data1, Data2): (t1, t2)

The next rule states that the empty sequence value always has the empty sequence type:


|- (): ()

The rules above associate the most specific type possible with a Data value. The remaining rules in this section associate more general types with a Data value, which are necessary when the type system must determine whether a Data value with most specific type t is permissible when a value of type t' is expected. This occurs during type checking of a query.

The next two rules are also for attribute and element construction, but these rules specify more general types. The first rule states that if Data has type t, and Name is in the set of names defined by the wildcard expression Wild, then the given attribute expression also has type @Wild[t]. The subsequent rule is analogous for element expressions.

|- Data : t      Name <:Wild Wild

|- @Name[Data] : @Wild[t]

|- Data : t

|- Name[Data] : Wild[t]

The next rule states that if the value Data has type t1, then it also has type t1| t2. The subsequent rule is analogous and states that the choice operator is associative.

|- Data : t1

|- Data : (t1| t2)

|- Data : t2

|- Data : (t1| t2)

The next rule states that the sequence of one value with type t followed by a value with repetition type t{m,n} has repetition type t{m+1,n +1}.

|- Data1: t      |- Data2: t {m, n}

|- (Data1, Data2): t {m +1, n +1}

The next rule states that the sequence of one value with type t followed by a value with repetition type t{0,n} has repetition type t{0,n +1}.

|- Data1: t      |- Data2: t {0, n}

|- (Data1, Data2): t {0, n +1}

The next rule states that the empty sequence is a repetition of any type with lower bound 0:

 

|- (): t {0, n}

5.2 Equivalences and subtyping

The symbol <: denotes the subtype relation. We write t1<: t2 if for every data Data such that |- Data : t1, it is also the case that |- Data : t2, i.e., is t1 is a subtype of t2. The subtyping relation is used in many of the type rules that follow. It is easy to see that the subtype relation <: is a partial order, i.e., it is reflexive, t <: t, and it is transitive, if t1<: t2and t2<: t3then t1<: t3. Here are some of the inequalities that hold:

Ø <: t
t <: AnyType
p <: AnySimpleType
t1 <: t1 | t2
t2 <: t1 | t2

Further, if Name <:wild Wild and t <: t', then

Name[t] <: Wild[t']

If t <: t' and m ³ m' and n £ n' then

t {m, n} <: t {m', n'}

If t1<: t1' and t2<: t2' then

t1, t2 <: t1', t2'

We write t1= t2 if t1<:t2 and t2<: t1. Here are some of the equalities that hold:

AnySimpleType = Integer | String | Boolean, etc.
(t1, t2), t3 = t1, (t2, t3)
t, () = t
(), t = t
t1| t2 = t2| t1
(t1| t2) | t3 = t1 | (t2| t3)
t | Ø = t
Ø | t = t
t1, (t2| t3) = (t1, t2) | (t1, t3)
(t1| t2), t3 = (t1, t3) | (t2, t3)
(t1 & t2) = (t1, t2) | (t2, t1)
t, Ø = Ø
Ø, t = Ø
Wild[t1| t2] = Wild[t1] | Wild[t2]
Wild[t] | {*}*[t] = {*}*[t]
t {m +1, n +1} = t, (t {m, n})
t {0, n +1} = () | t, (t {0, n})
t {0, 0} = ()
t {m, n} = Ø, if m > n or m = *

We also have that t1<: t2 if and only if t1| t2= t2.

We define the intersection t1 /\ t2 of two types t1 and t2 to be the largest type t that is smaller than both t1 and t2. That is, t = t1 /\ t2 if t <: t1 and t <: t2 and if for any t' such that t' <: t1 and t' <: t2, we have t' <: t.

5.3 Factoring types

Before giving the remaining type rules, we introduce factoring: Many of the operations in the algebra have an operand that should be of a repetition type. This is true for the built-in operators index, sort and distinct.

For example, consider the following.

    index (children (book0))
==> pair [ fst [ 1 ], snd [ title [ "Data on the Web" ] ], 
    pair [ fst [ 2 ], snd [ author [ "Abiteboul" ] ] ], 
    pair [ fst [ 3 ], snd [ author [ "Buneman" ] ] ], 
    pair [ fst [ 4 ], snd [ author [ "Suciu" ] ] ], 
    pair [ fst [ 5 ], snd [ year [ 1999 ] ]

How should we describe the type of the result? By nesting the children of book0 under snd the original sequence of title, author+, year gets lost. snd can contain either a title, an author, or a year. More formally, we need to find q, m, and n such that

 children(book0) : q{m, n} 

and then the type is given by

 index(children(book0)) : pair [ fst [ String ], snd [q] ]{m, n} 

In the case of books, the values of q are:

 title [ String ] | author [ String ] | year [ Integer] 

the value of m is 3 (because there will be one title, at least one author, and one year element) and the value of n is * (because there may be any number of author elements).

We call a type like q a prime type. In general, it may contain scalar, attribute, element, choice, and empty choice types, but it will not contain repetition, sequence, or empty sequence types (except, perhaps, within an element or attribute type). The definition of prime types appears in [Figure 1].


factor (p) = p {1, 1}
factor (Wild[t]) = Wild[t]{1, 1}
factor (@Wild[t]) = @Wild[t]{1, 1}
factor (t1 , t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti)
factor (t1 & t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti)
factor (t1| t2) = (q1| q2){m1min m2, n1max n2} where qi{mi, ni} = factor (ti)
factor (t {m, n}) = q {m' · m, n' · n} where q {m', n'} = factor (t)
factor (()) = Ø{0, 0}
factor (Ø) = Ø{*, 0}
 


Figure 6: Definition of factoring

factor, as shown in [Figure 6], converts any type t to a type of the form q {m, n}, where t <: q {m, n}, so that any value that has type t also has type q {m, n}. For example,

factor ( title[String], author[String]{1, *}, year[Integer] )
  = (title[String] | author[String] | year[Integer]){3, *}

We can see here that the factored type is less specific than the unfactored type. For mnemonic convenience we write q {m, n} = factor (t), but one should actually think of the function as returning a triple consisting of a prime type q and two bounds m and n.

Just as factoring a number yields a product of prime numbers, factoring a type yields a repetition of prime types. Further, the result yielded by factoring is in some sense optimal. If q {m, n} = factor (t) then t <: q {m, n} and furthermore for any q', m', and n' such that t <: q'{m', n'} we have that q <: q' and m >= m' and n <= n'. Also, if t = t', then factor (t) = factor (t'). In particular, the choice of the lower bound * for factor (Ø) guarantees that factor (t) = factor (t | Ø), since m min * = m.

Note that factor is only used by the type inference rules, and thus is not part of the algebra expressions.

5.4 Environments

The type rules make use of an environment that specifies the types of variables and functions. The type environment is denoted by G, and is composed of a comma-separated list of variable types, Var : t or function types, FuncName :(t1 ;  ... ;  tn) : t. We retrieve type information from the environment by writing (Var : t) Î G to look up a variable, or by writing (FuncName : (t1;...; tn) : t) Î G to look up a function.

The type checking starts with an environment that contains all the types declared for functions and global variables. For instance, before typing the first query of [2.2 Projection], the environment contains:

G = bib0 : Bib, book0 : Book 

While doing the type-checking, new variables will be added in the environment. For instance, when typing the query of [2.4 Iteration], variable b will be typed with Book, and added in the environment. This will result in a new environment:

G' = G, b : Book

5.5 Expressions

We write G |- Exp : t if in environment G the expression Exp has type t. Below are all the rules except those for for and match expressions, which are discussed in later subsections.

The next rule states that in any environment G, a constant whose lexical form defines a value of type p has type p. So, for example, G |- 1 : Integer and G |- "two" : String.


G |- Constp: p

The next five rules are for typing expressions that define names.

 

|- LocalName: {Null}LocalName

 

|- {Namespace}LocalName: {Namespace}LocalName

G |- Exp : NCName

G |- {Namespace}Exp: {Namespace}*

G |- Exp : URI

G |- {Exp}LocalName: {*}LocalName

G |- Exp1: URI      G |- Exp2: NCName

G |- {Exp1}Exp2: {*}*

The next rule is a trivial definition of the variable Var having type t in environment G:

(Var : t) Î G

G |- Var : t

The next two rules are for attribute and element construction with a constant tag name. The first rule states that if Exp has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.

G |- Exp : t

G |- @Name[Exp]: @Name[t]

G |- Exp : t

G |- Name[Exp]: Name[t]

The next two rules are for attribute and element construction in which the tag name is computed. The first rule states that if Exp1 is in the set of names defined by Wild, and Exp2 has type t, then the given computed attribute expression has type @Wild[t]. The subsequent rule is analogous for element expressions.

G |- Exp1: Wild      G |- Exp2: t

G |- *@Exp1[ Exp2]: @Wild[t]

G |- Exp1: Wild      G |- Exp2: t

G |- *Exp1[ Exp2]: Wild[t]

The following two rules are analogous to the sequence and empty sequence rules in [5.1 Relating data to types].

G |- Exp1: t1      G |- Exp2: t2

G |- Exp1, Exp2: t1, t2


G |- () : ()

Note that in the type rule for a conditional expression, the result type is the choice (t2|t3).

G |- Exp1: Boolean      G |- Exp2: t2      G |- Exp3: t3

G |- if Exp1 then Exp2 else Exp3: (t2 | t3)

A let expression extends the current environment G with the variable Var with type t. Note that Exp2, the body of the let expression, is typed in the extended environment, and the type of the entire let expression is t2.

G |- Exp1: t1      G, Var : t1 |- Exp2: t2

G |- let Var = Exp1 do Exp2: t2

The next rule is for function application. In a function application, the type of each actual argument to the function must be a subtype of the corresponding formal argument to the function, i.e., it is not necessary for the actual and formal types to be equal.

(FuncName : (t1;...; tn) : t) Î G
G |- Exp1: t'1      t'1 <: t1
···
G |- Expn: t'n      t'n <: tn

G |- FuncName (Exp1;...; Expn) : t

The next rule states that is always permissible to explicitly type an expression with a type t' that is a supertype of the expression's type t. In programming-language terminology, this operation is sometimes called an ``upcast''.

G |- Exp : t'      t' <: t

G |- (Exp : t) : t

The error expression always has the empty choice type.


G |- error : Ø

The attributes, children, and name operators only apply to values with element type. The first two rules require that the content type t is equal to an attribute group followed by an element group. This constraint permits extraction of the appropriate result type.

G |- Exp : Wild[t]      t = AttrGroup, ElemGroup

G |- attributes(Exp) : AttrGroup

G |- Exp : Wild[t]      t = AttrGroup, ElemGroup

G |- children(Exp) : ElemGroup

Ed. Note: MF, Oct 23/2000: I'm not sure if using the abstract syntax for types here is fair game. An alternative is defining two type functions attrgroup and elemgroup, which extract the constituent types, but that seemed overkill given that we already specify the constraint on content type in the abstract syntax.

The next rule extracts the type of the element's name from the element's type Wild[t].

G |- Exp : Wild[t]

G |- name(Exp) : Wild

5.6 Operators and built-in functions

The complete set of operators are not yet defined, so it is not possible to give the typing rules for operators. In general, however, arithmetic operators will have a type rule such as the following, in which t1 and t2 are numeric types and appropriate conversion exist between the two:

G |- Exp1 : t1      G |- Exp2 : t2      t1 <: Integer | Float | Decimal, etc.     t2 <: Integer | Float | Decimal, etc.

G |- Exp1 Oparith Exp2 : t

Equality and inequality operators are typed similarly.

G |- Exp1 : t1      G |- Exp2 : t2

G |- Exp1 Opeq Exp2 : Boolean

The type rule for index requires that its argument be a factored type. The second expression above the judgement line converts t into a factored type.

G |- Exp : t      q {m, n} = factor (t)

G |- index Exp : pair [ fst [ Integer ], snd [q]]{m, n}

Similarly, the type rule for sort also requires a factored type. Note that the key expression Exp2 is typed in the extended environment G |- Var : q.

G |- Exp1 : t1      q{m, n} = factor(t1)      G, Var : q |- Exp2 : t2

G |- sort Var in Exp1 by Exp2: q{m, n}

Ed. Note: MF, Oct 23/2000: This definition assumes that the equality operator on t is defined. An alternative is requiring Exp2 to have AnyScalar, but that seems too restrictive.

The types of aggregated expressions must be factored, and their prime type must be a numeric type.

G |- Exp : t      q{m, n} = factor (t)      q <: Integer | Float | Decimal, etc.

G |- agg Exp : Integer
agg is one of avg, max, min, sum.

G |- Exp : t

G |- count Exp : Integer

The distinct operator also requires an expression with a factored type, and because distinct removes duplicates, the minimum bound of the result type is the minimum of 1 and the minimum bound of the factored type.

G |- Exp : t      q {m, n} = factor (t)

G |- distinct Exp : q {m min 1, n}

The last two rules are for name expressions: they extract the constituent parts of a name.

G |- Exp : Wild

G |- namespace(Exp) : URI

G |- Exp : Wild

G |- localname(Exp) : NCName

5.7 Iteration expressions

The typing of for expressions is rather subtle. We give an intuitive explanation first and then the detailed typing rules below.

A unit type is defined in [Figure 1]; it is either an atomic type, or an attribute or element type with a constant or wildcard name. A for expression

for Var in Exp1 do Exp2

is typed as follows. First, one finds the type of expression Exp1. Next, for each unit type in this type one assumes the variable Var has the unit type and one types the body Exp2. Note that this means we may type the body of Exp2 several times, once for each unit type in the type of Exp1. Finally, the types of the body Exp2 are combined, according to how the types were combined in Exp1. That is, if the type of Exp1 is formed with sequencing, then sequencing is used to combine the types of Exp2, and similarly for choice or repetition.

For example, consider the following expression, which selects all author elements from a book.

    for c in children(book0) do
      match c
        case a : author[AnyType] do a
        else ()

The type of children(book0) is

    title[String], year[Integer], author[String]{1,*}

This is composed of three unit types, and so the body is typed three times.

assuming c has type title[String] the body has type ()
" year[Integer] " ()
" author[String] " author[String]

The three result types are then combined in the same way the original unit types were, using sequencing and iteration.

    (), (), author[String]{1,*}

as the type of the iteration, and simplifying yields

    author[String]{1,*}

as the final type.

As a second example, consider the following expression, which selects all title and author elements from a book, and renames them.

    for c in children(book0) do
      match c
        case t : title[String]  do titl [ t/data() ]
        case y : year[Integer]  do ()
        case a : author[String] do auth [ a/data() ]
        else error()

Again, the type of children(book0) is

    title[String], year[Integer], author[String]{1,*}

This is composed of three unit types, and so the body is typed three times.

assuming c has type title[String] the body has type titl[String]
" year[Integer] " ()
" author[String] " auth[String]

The three result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields

    titl[String], (), auth[String]{1,*}

as the type of the iteration, and simplifying yields

    titl[String], auth[String]{1,*}

as the final type. Note that the title occurs just once and the author occurs one or more times, as one would expect.

As a third example, consider the following expression, which selects all basic parts from a sequence of parts.

    for p in children(part0/subparts) do
      match p
        case b : Basic     do b
        case c : Composite do ()
        else error()

The type of children(part0/subparts) is

    (Basic | Composite){1,*}

This is composed of two unit types, and so the body is typed two times.

assuming p has type Basic the body has type Basic
" Composite " ()

The two result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields

    (Basic | ()){1,*}

as the type of the iteration, and simplifying yields

    Basic{0,*}

as the final type. Note that although the original type involves repetition one or more times, the final result is a repetition zero or more times. This is what one would expect, since if all the parts are composite the final result will be an empty sequence.

In this way, we see that for expressions can be combined with match expressions to select and rename elements from a sequence, and that the result is given a sensible type.

In order for this approach to typing to be sensible, it is necessary that the unit types can be uniquely identified. However, the type system given here satisfies the following law.

 a [t1 | t2] = a [t1] | a [t2] 

This has one unit type on the left, but two distinct unit types on the right, and so might cause trouble. Fortunately, our type system inherits an additional restriction from XML Schema: we insist that the regular expressions can be recognized by a top-down deterministic automaton. In that case, the regular expression must have the form on the left, the form on the right is outlawed because it requires a non-deterministic recognizer. With this additional restriction, there is no problem.

The method of translating projection to iteration described in the previous section combined with the typing rules given here yield optimal types for projections, in the following sense. Say that variable x has type t, and the projection x / a has type t'. The type assignment is sound if for every value of type t, the value of x / a has type t'. The type assignment is complete if for every value y of type t' there is a value x of type t such that x / a = y. In symbols, we can see that these conditions are complementary.

sound: forall x Î t. exist y Î t'. x / a = y
complete: forall y Î t'. exist x Î t. x / a = y

Any sensible type system must be sound, but it is rare for a type system to be complete. But, remarkably, the type assignment given by the above approach is both sound and complete.

The type rule for for expressions uses the following auxiliary judgement. We write G |- for Var : t do Exp : t', if in environment G when the bound variable Var of an iteration has type t then the body Exp of the iteration has type t'.

G, Var : t |- Exp : t'

G |- for Var : t do Exp : t


G |- for Var : () do Exp : ()

G |- for Var : t1 Exp : t'1      G |- for Var : t2 Exp : t'2

G |- for Var : t1, t2 do Exp : t'1, t'2


G |- for Var : Ø do Exp : Ø

G |- for Var : t1 Exp : t'1      G |- for Var : t2 Exp : t'2

G |- for Var : t1 | t2 do Exp : t'1 | t'2

G |- for Var : t Exp : t'

G |- for Var : t* do Exp : t'*

Given the above rules, the type rule for for expressions is immediate.

G |- Exp1 : t1      G |- for Var : t1 do Exp2 : t2

G |- for Var in Exp1 do Exp2 : t2

5.8 Match expressions

The typing of match expressions is closely related to the typing of for expressions. Due to the typing rules of for expressions, it is possible that the body of an iteration is checked many times. Thus, when a match expression is checked, it is possible that quite a lot is known about the type of the expression being matched, and one can determine that only some of the clauses of the match apply. The definition of match uses the auxiliary judgments to check whether a given clause is applicable.

We write G |- case Var : t do Exp : t', if in environment G, the bound variable Var of the case has type t, then the body Exp of case has type t'. Note the type of the body is irrelevant if t = Ø.

¬(t = Ø)     G, Var : t |- Exp : t'

G |- case Var : t do Exp : t'


G |- case Var : Ø do Exp : Ø

We write G |- t <: t' else Exp: t'' if in environment G when t <: t' does not hold, then the body Exp of the match expression's else clause has type t''. Note that the type of the body is irrelevant if t < t'.

t <: t'

G |- t <: t' else Exp : Ø

¬ t <: t'      G |- Exp : t''

G |- t <: t' else Exp : t''

Given the above, it is straightforward to construct the typing rule for a match expression. Recall that we write t /\ t' for the intersection of two types.

G |- Exp0 : t0
G |- case Var1 : t0 /\ t1 do Exp1 : t'1
···
G |- case Varn : t0 /\ tn do Expn : t'n
G |- t0 < t1 | ... | tn else Expn+1 : t'n+1

G |-  
(match Exp0
    case Var1 : t1 do Exp1
    ···
    case Varn : tn do Expn
    else Expn+1)
 : t'1 | ... | t'n+1

5.9 Top-level expressions

We write G |- Query if in environment G, the query item Query is well-typed. A type declaration is always well typed:

 

G |- type x = t

A function declaration is well-typed if in the environment extended with the type assignments for its formal variables, its body is well-typed.

G, Var1 : t1, ..., Varn : tn |- Exp : t'      t' <: t

G |- fun FuncName (Var1 : t1; ...; Varn : tn) : t = Exp

A top-level let expression is well-typed if the type of its body, t', is a subtype of the bound variable's type.

G |- Exp : t'      t' <: t

G |- let Var : t = Exp

Finally, a top-level query expression is well-typed if its body is well-typed.

G |- Exp : t

G |- query Exp

We extract the relevant component of a type environment from a query item Query with the function environment (Query).

environment ( type x = t) = ()
environment ( fun FuncName (Var1: t1; ...; Varn: tn) : t) = FuncName (Var1: t1; ...; Varn: tn) : t
environment ( let Var : t = Exp) = Var : t

We write |- Query1 ... Queryn if the sequence of query items Query1 ... Queryn is well typed.

G |- environment (Query1),..., environment (Queryn)
G |- Query1      ...      G |- Queryn

Query1... Queryn

6 References

AHV95
S. Abiteboul, R. Hull, V. Vianu. Foundations of Databases. Addison Wesley, 1995.
Bird98
Richard Bird. Introduction to Functional Programming using Haskell. Prentice Hall, 1998.
BFS00
P. Buneman, M. Fernandez, D. Suciu. UnQL: A query language and algebra for semistructured data based on structural recursion. VLDB Journal, to appear.
BK93
Catriel Beeri and Yoram Kornatzky. Algebraic Optimization of Object-Oriented Query Languages. Theoretical Computer Science 116(1&2):59--94, August 1993.
BKD90
Francois Bancilhon, Paris Kanellakis, Claude Delobel. Building an Object-Oriented Database System. Morgan Kaufmann, 1990.
BLSTW94
Peter Buneman, Leonid Libkin, Dan Suciu, Van Tannen, and Limsoon Wong. Comprehension Syntax. SIGMOD Record, 23:87--96, 1994.
BMR99
David Beech, Ashok Malhotra, Michael Rys. A Formal Data Model and Algebra for XML. W3C XML Query working group note, September 1999.
BNTW95
Peter Buneman, Shamim Naqvi, Val Tannen, Limsoon Wong. Principles of programming with complex object and collection types. Theoretical Computer Science 149(1):3--48, 1995.
BT99
Catriel Beeri and Yariv Tzaban, SAL: An Algebra for Semistructured Data and XML, International Workshop on the Web and Databases (WebDB'99), Philadelphia, Pennsylvania, June 1999.
Cat97
R. G. Cattell. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.
Quilt
Don Chamberlin, Jonathan Robie, and Daniela Florescu. Quilt: An XML Query Language for Heterogeneous Data Sources. International Workshop on the Web and Databases (WebDB'2000), Dallas, Texas, May 2000.
CCS00
Vassilis Christophides and Sophie Cluet and Jérôme Siméon. On Wrapping Query Languages and Efficient XML Integration. Proceedings of ACM SIGMOD Conference on Management of Data, Dallas, Texas, May 2000.
CM93
S. Cluet and G. Moerkotte. Nested queries in object bases. Workshop on Database Programming Languages, pages 226--242, New York, August 1993.
YAT99
S. Cluet, S. Jacqmin and J. Siméon The New \YaTL: Design and Specifications. Technical Report, INRIA, 1999.
Col90
L. S. Colby. A recursive algebra for nested relations. Information Systems 15(5):567--582, 1990.
Date97
Hugh Darwen (Contributor) and Chris J. Date. Guide to the SQL Standard : A User's Guide to the Standard Database Language SQL Addison-Wesley, 1997.
XMLQL99
A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for XML. In International World Wide Web Conference, 1999. See http://www.research.att.com/~mff/files/final.html
GTW78
J. A. Goguen, J. W. Thatcher, E. G. Wagner. An initial algebra approach to the specification, correctness, and implementation of abstract data types. In Current Trends in Programming Methodology, pages 80--149, Prentice Hall, 1978.
KKS92
M. Kifer, W. Kim, and Y. Sagiv. Querying object-oriented databases. Proceedings of ACM SIGMOD Conference on Management of Data, pages 393--402, San Diego, California, June 1992.
LW97
Leonid Libkin and Limsoon Wong. Query languages for bags and aggregate functions. Journal of Computer and Systems Sciences, 55(2):241--272, October 1997.
HP2000
Haruio Hosoya, Benjamin Pierce, XDuce : A Typed XML Processing Language (Preliminary Report) WebDB Workshop 2000.
LMW96
Leonid Libkin, Rona Machlin, and Limsoon Wong. A query language for multi-dimensional arrays: Design, implementation, and optimization techniques. SIGMOD 1996.
Mitchell
John C. Mitchell Foundations for Programming Languages. MIT Press, 1998. *** Write this ****
Mog89
E. Moggi, Computational lambda-calculus and monads. In Symposium on Logic in Computer Science Asilomar, California, IEEE, June 1989.
Mog91
E. Moggi, Notions of computation and monads. Information and Computation, 93(1), 1991.
MSL
A. Brown, M. Fuchs, J. Robie, P. Wadler, editors, MSL : A Model for W3C XML Schema. October, 2000.
OCaml
The Caml Language. See http://pauillac.inria.fr/caml/.
XQL99
J. Robie, editor. XQL '99 Proposal, 1999. See http://www.ibiblio.org/xql/xql-proposal.html.
SS86
H.-J. Schek and M. H. Scholl. The relational model with relational-valued attributes. Information Systems 11(2):137--147, 1986.
TF86
S. J. Thomas and P. C. Fischer. Nested Relational Structures. In Advances in Computing Research: The Theory of Databases, JAI Press, London, 1986.
Wad92
Philip Wadler. Comprehending monads. Mathematical Structures in Computer Science, 2:461-493, 1992.
Wad93
P. Wadler, Monads for functional programming. In M. Broy, editor, Program Design Calculi, NATO ASI Series, Springer Verlag, 1993. Also in J. Jeuring and E. Meijer, editors, Advanced Functional Programming, LNCS 925, Springer Verlag, 1995.
Wad95
P. Wadler, How to declare an imperative. ACM Computing Surveys, 29(3):240--263, September 1997.
Wad99
Philip Wadler. A formal semantics of patterns in XSLT. Markup Technologies, Philadelphia, December 1999.
Won00
Limsoon Wong. An introduction to the Kleisli query system and a commentary on the influence of functional programming on its implementation. Journal of Functional Programming, to appear.
XML Names
World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names .
XSLT99
World-Wide Web Consortium XSL Transformations (XSLT), Version 1.0. W3C Recommendation, November 1999. See http://www.w3.org/TR/xslt.
XQDM00
World-Wide Web Consortium XML Query Data Model, Working Draft, May 2000. See http://www.w3.org/TR/query-datamodel.
XPath
World-Wide Web Consortium XML Path Language (XPath) : Version 1.0. November, 1999. See http://www.w3.org/TR/xpath.html.
XSchema1
World-Wide Web Consortium XML Schema Part 1 : Structures, Working Draft. April 2000. See http://www.w3.org/TR/xmlschema-1
XSchema2
World-Wide Web Consortium XML Schema Part 2 : Datatypes, Working Draft, April 2000. See http://www.w3.org/TR/xmlschema-2.

A The XML Query Data Model

The XML Query Algebra uses the XML Query Data Model [XQDM00]. Because the XML examples in this document do not use many of the features of the data model, such as attributes, declared namespaces, comments, the algebra uses an abbreviated notation. Here, we give the mapping from the abbreviated algebraic expressions to complete expressions in the XML Query Data Model:

        s         = ref (stringValue (s, ref Def_string, ref []));
        i         = ref (integerValue(i, ref Def_integer));
        a[ kids ] = ref (elemNode
                           (qnameValue (null, a, ref Def_string), {}, {}, kids, (ref Def_t))
                        );

For example, the string s constructs a string value and the integer i constructs an integer value. The term a[ kids ] constructs an ElemNode with the tag a, an empty set of namespaces, an empty set of attributes, the forest of children nodes kids, and the XML Schema type t, where t is the type of the expression a[ kids ].

Again, to simplify presentation, types do not include attributes, but they can be added easily. Subtyping or containment relationships exist between types. We discuss subtyping in [5 Type Rules].

Ed. Note: MF (Aug-11-2000): This section needs to be completed.

B Issues

B.1 Introduction

The issues in [B.2 Issues list] serve as a design history for this document. The ordering of issues is irrelevant. Each issue has a unique id of the form Issue-<dddd> (where d is a digit). This can be used for referring to the issue by <url-of-this-document>#Issue-<dddd>. Furthermore, each issue has a mnemonic header, a date, an optional description, and an optional resolution. For convenience, resolved issues are displayed in green. Some of the issues contain references to W3C internal archives. These are marked with "W3C-members only". Some of the descriptions of the resolved issues are obsolete w.r.t. to the current version of the document.

Ed. Note: PF (Aug-05-2000): For the sake of archival, there are some duplicate issues raised in multiple instances. Duplicate issues are marked as "resolved" with reference to the representative issue.

B.2 Issues list

Unless stated explicitly otherwise, [Issue-0001:  Attributes] through [Issue-0039:  Dereferencing semantics]  have been raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0142.html (W3C-members only).

Issue-0001:  Attributes

Date: Jul-26-2000

Description: One example of the need for support of [Issue-0049:  Unordered Collections], but also: Attributes need to be constrained to contain white space separated lists of simple types only.

Resolution: Attributes are represented by @attribute-name[content]. See [3.4 Types : abstract syntax], [3.5 Types : concrete syntax] for the constraint on white space seperated lists of simple types, and [3.6 Expressions] for selecting and constructing attributes.

Issue-0002:  Namespaces

Date: Jul-26-2000

Resolution: Namespaces are represented by {uri-of-namespace}localname. See [3.1 Expanded names] and [3.2 Wildcards].

Issue-0003:  Global Order

Date: Jul-26-2000

Description: The data model and algebra do not define a global order on documents. Querying global order is often required in document-oriented queries.

See the thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0179.html (W3C-members only).

Issue-0004:  References vs containment

Date: Jul-26-2000

Description: The query-algebra datamodel currently does not explicitly model children-elements by references (other than the XML-Query Datamodel. This facilitates presentation, but may be an oversimplification with regard to [Issue-0005:  Element identity].

Resolution: This issue is resolved by subsumption as follows: (1) As [A The XML Query Data Model] points out, all child-elements are (implicit) references to nodes. (2) Thus, having resolved [Issue-0005:  Element identity] this issue is resolved too.

Issue-0005:  Element identity

Date: Jul-26-2000

Description: Do expressions preserve element identity or don't they? And does "=" and distinct use comparison by reference or comparison by value?

Resolution: The first part of the question has been resolved by resolution of [Issue-0010:  Construct values by copy]. The second part raises a more specific issue [Issue-0066:  Shallow or Deep Equality?].

Issue-0006:  Source and join syntax instead of "for"

Date: Jul-26-2000

Description: Another term for "source and join syntax" is "comprehension". See [4.3 Laws] for a discussion of the relationship between iteration by for and comprehension syntax.

Resolution: This issue is resolved by subsumption under [Issue-0021:  Syntax]. List comprehension is a syntactic alternative to "for v in e1 do e2", which has been favored by the WG in the resolution of [Issue-0021:  Syntax].

Issue-0007:  References: IDREFS, Keyrefs, Joins

Date: Jul-26-2000

Description: Currently, the algebra does not support reference values, such as IDREF, or Keyref (not to be mixed up with "node-references" - see [Issue-0005:  Element identity], which are defined in the XML Query Data Model. The algebra's type system should be extended to support reference types and the data model operators ref, and deref should be supported (similar to id() in XPath).

Issue-0008:  Fixed point operator or recursive functions

Date: Jul-26-2000

Description:  It may be useful to add a fixed-point operator, which can be used in lieu of recursive functions to compute, for example, the transitive closure of a collection.

Currently, the algebra does not guarantee termination of recursive expressions. In order to ensure termination, we might require that a recursive function take one argument that is a singleton element, and any recursive invocation should be on a descendant of that element; since any element has a finite number of descendants, this avoids infinite regress. (Ideally, we should have a simple syntactic rule that enforces this restriction, but we have not yet devised such a rule.)

Impacts optimization; hard to do static type inference; current algebra is first-order

See for the subproblem of typing "//" or desc() http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0187.html (W3C-members only).

Issue-0009:  Externally defined functions

Date: Jul-26-2000

Description:  There is no explicit support for externally defined functions.

The set of built-in functions may be extended to support other important operators.

See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0003.html (W3C-members only).

Issue-0010:  Construct values by copy

Date: Jul-26-2000

Description: Need to be able to construct new types from bits of old types by reference and by copy. Related to [Issue-0005:  Element identity].

Resolution: The WG wishes to support both: construction of values by copy, as well as references to original nodes ( http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)) See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only). This needs some further investigation to sort out all technical difficulties (see [Issue-0062:  Open questions for constructing elements by reference]) so the change has not yet been reflected in the algebra document.

Issue-0011:  XPath tumbler syntax instead of index?

Date: Jul-26-2000

Description: XPath provides as a shorthand syntax [integer] to select child-elements by their position on the sibling axes, whereas the xml-query algebra uses a combination of a built-in function index() and iteration. See http://lists.w3.org/Archives/Member/w3c-archive/2000Sep/0168.html (W3C-members only) for a suggestion to to support indexed iteration in the form "for v sub i in e1 do e2", and to express index() as a function (or macro).

Issue-0012:  GroupBy - needs second order functions?

Date: Jul-26-2000

Description: The type system is currently first order: it does not support function types nor higher-order functions. Higher-order functions are useful for specifying, for example, sorting and grouping operators, which take other functions as arguments.

Resolution: The WG has decided to express groupBy by a combination of for and distinct (see also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) and [Issue-0042:  GroupBy]):. Thus w.r.t. to GroupBy this Issue is resolved. Because GroupBy is not the only use case for higher order functions, a new issue [Issue-0063:  Do we need (user defined) higher order functions?] is raised.

Issue-0013:  Collations

Date: Jul-26-2000

Description: Collations identify the ordering to be applied for sorting strings. Currently, it is considered to have an (optional parameter) collation "name" as follows: "SORT variable IN exp BY +(expression {ASCENDING|DESCENDING} {COLLATION name}) (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)). An alternative would be to model a collation as a simple type derived from string, and use type-level casting, i.e. expression :collationtype (which is already supported in the XML Query Algebra), for specifying the collation. That would make: "SORT variable IN exp BY +(expression:collationname {ASCENDING|DESCENDING}). But that requires some support from XML-Schema.

More generally, collations are important for any operator in the XML Query Algebra that involves string comparison, among them: sort, distinct, "=" and "<".

Issue-0014:  Polymorphic types

Date: Jul-26-2000

Description: The type system is currently monomorphic: it does not permit the definition of a function over generalized types. Polymorphic functions are useful for factoring equivalent functions, each of which operate on a fixed type.

The current type system has already a built-in polymorphic type (lists) and is likely to have more (unordered collections). The question is, whether to allow for user-defined polymorphic types and user defined polymorphic functions.

See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0111.html (W3C-members only).

Issue-0015:  3-valued logic to support NULLs

Date: Jul-26-2000

Issue-0016:  Mixed content

Date: Jul-26-2000

Description: The XML-Query Algebra allows to generate elements with an arbitrary mixture of data (of simple type) and elements. XML-Schema only allows for a combination of strings interspersed with elements (aka mixed content). We need to figure out whether and how to constrain the XML-Query Algebra accordingly (e.g. by typing rules?)

Issue-0017:  Unordered content

Date: Jul-26-2000

Description: All-groups in XML-Schema, not to be mixed up with [Issue-0049:  Unordered Collections]

Resolution: The type system has been extended with the support of all-groups - see [3.4 Types : abstract syntax].

Issue-0018:  Align algebra types with schema

Date: Jul-26-2000

Description:  The algebra's internal type system is the type system of XDuce. A potentially significant problem is that the algebra's types may lose information when converted into XML Schema types, for example, when a result is serialized into an XML document and XML Schema.

This issue comprises also issues [Issue-0016:  Mixed content], [Issue-0017:  Unordered content], [Issue-0053:  Global vs. local elements], [Issue-0054:  Global vs. local complex types], [Issue-0019:  Support derived types], substitution groups.

Issue-0019:  Support derived types

Date: Jul-26-2000

Description: The current type system does not support user defined type hierarchies (by extension or by restriction).

Issue-0020:  Structural vs. name equivalence

Date: Jul-26-2000

Description: The subtyping rules in [5.1 Relating data to types] only define structural subtyping. We need to extend this with support for subtyping via user defined type hierarchies - this is related to [Issue-0019:  Support derived types].

Issue-0021:  Syntax

Date: Jul-26-2000

Description: (e.g. for.<-.in vs for.in.do)

Resolution: The WG has voted for several syntax changes (see also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), [3.6 Expressions]: "for v in e do e", "let v = e do", "sort v in e by e ...", "distinct", "match case v:t e ... else e".

Issue-0022:  Indentation, Whitespaces

Date: Jul-26-2000

Description: Is indentation significant?

Resolution: The WG has consensus that indentation is not significant (see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0043.html (W3C-members only)), i.e., all documents are white space normalized.

Issue-0023:  Catch exceptions and process in algebra?

Date: Jul-26-2000

Description: Does the algebra give explicit support for catching exceptions and processing them?

Resolution: Subsumed by new issue [Issue-0064:  Error code handling in Query Algebra].

Issue-0024:  Value for empty forests

Date: Jul-26-2000

Description: What does "value" do with empty forests?

Resolution: The definition of value(e) has changed to:

value(e) = match children(e)
                     case v: AnyScalar do v
                     else()

Furthermore, the typing rules for "for v in e1 do e2" have been changed such that the variable v is typed-checked seperately for each unit-type occuring in expression e1.

Consequently the example in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html (W3C-members only) would be typed as follows:

query for b in b0/book do
      value(b/year): Integer{0,*}

rather than leading to an error.

Issue-0025:  Treatment of empty results at type level

Date: Jul-26-2000

Description: According to http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html (W3C-members only) this is closely related to [Issue-0024:  Value for empty forests].

Resolution: Resolved by resolution of [Issue-0025:  Treatment of empty results at type level].

Issue-0026:  Project - one tag only

Date: Jul-26-2000

Description: Project is only parameterized by one tag. How can we translate a0/(b | c)?

Resolution: With the new syntax (and type system) a0/(b | c) can be translated to "for v in a0 do match case v1:b[AnyType] do v1 case v2:c[AnyType] do c else ()" - see also [4.1 Relating projection to iteration].

Issue-0027:  Case syntax

Date: Jul-26-2000

Description: N-ary case can be realized by nested binary cases. For design alternatives of case see: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0017.html (W3C-members only)

Resolution: New (n-ary) case syntax is introduced in [3.6 Expressions].

Issue-0028:  Fusion

Date: Jul-26-2000

Description: Does the algebra support fusion as introduced by query languages such as LOREL? This is related to [Issue-0005:  Element identity], because fusion only makes sense with support of element identity.

Issue-0029:  Views

Date: Jul-26-2000

Description: One of the problems in views: Can we undeclare/hide things in environment? For example, if we support element-identity, can we explicitly discard a parent, and/or children from an element in the result-set? Related to [Issue-0005:  Element identity]. See also description in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0047.html (W3C-members only).

Issue-0030:  Automatic type coercion

Date: Jul-26-2000

Description: What do we do if a value does not have a type or a different type from what is required? See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0071.html (W3C-members only). This link also contains a recommendation, which has been agreed as the general direction to go in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only):

Suggested Resolution: We believe that the XML Query Language should specify default type coercions for mixed mode arithmetic should be performed according to a fixed precedence hierarchy of types, specifically integer to fixed decimal, fixed decimal to float, float to double. This policy has the advantage of simplicity, tradition, and static type inference. Programmers could explicitly specify alternative type coercions when desirable.

Issue-0031:  Recursive functions

Date: Jul-26-2000

Resolution: subsumed by [Issue-0008:  Fixed point operator or recursive functions]

Issue-0032:  Full regular path expressions

Date: Jul-26-2000

Description: Full regular path expressions allow to constrain recursive navigation along paths by means of regular expressions, e.g. a/b*/c denotes all paths starting with an a, proceeding with arbitrarily many b's and ending in a c. Currently the XML-Query Algebra can express this by means of (structurally) recursive functions. An alternative may be the introduction of a fixpoint operator [Issue-0008:  Fixed point operator or recursive functions].

Issue-0033:  Metadata Queries

Date: Jul-26-2000

Description: Metadata queries are queries that require runtime access to type information. See also discussion starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0087.html (W3C-members only).

Issue-0034:  Fusion

Date: Jul-26-2000

Resolution: Identical with [Issue-0028:  Fusion]

Issue-0035:  Exception handling

Date: Jul-26-2000

Resolution: Subsumed by [Issue-0023:  Catch exceptions and process in algebra?] and [Issue-0064:  Error code handling in Query Algebra].

Issue-0036:  Global-order based operators

Date: Jul-26-2000

Resolution: Subsumed by [Issue-0003:  Global Order]

Issue-0037:  Copy vs identity semantics

Date: Jul-26-2000

Resolution: subsumed by [Issue-0005:  Element identity]

Issue-0038:  Copy by reachability

Date: Jul-26-2000

Description: Is it possible to copy children as well as IDREFs, Links, etc.? Related to [Issue-0005:  Element identity] and [Issue-0008:  Fixed point operator or recursive functions]

Issue-0039:  Dereferencing semantics

Date: Jul-26-2000

Resolution: subsumed by [Issue-0005:  Element identity]

[Issue-0040:  Case Syntax] through [Issue-0047:  Attributes]  are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0010.html (W3C-members only)

Issue-0040:  Case Syntax

Date: Aug-01-2000

Description: We suggest that the syntax for "case" be made more regular. At present, it takes only two branches, the first labelled with a tag-name and the second labelled with a variable. A more traditional syntax for "case" would have multiple branches and label them in a uniform way. If the algebra is intended only for semantic specification, "case" may not even be necessary.

Resolution: subsumed by [Issue-0027:  Case syntax]

Issue-0041:  Sorting

Date: Aug-01-2000

Description: We are not happy about the three-step sorting process in the algebra. We would prefer a one-step sorting operator such as the one illustrated below, which handles multiple sort keys and mixed sorting directions: SORT emp <- employees BY emp/deptno ASCENDING emp/salary DESCENDING

Resolution: The WG has decided to go for the above syntax, with an (optional) indication of COLLATION. (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), [2.10 Sorting]).

Issue-0042:  GroupBy

Date: Aug-01-2000

Description: We do not think the algebra needs an explicit grouping operator. Quilt and other high-level languages perform grouping by nested iteration. The algebra can do the same.

related to [Issue-0012:  GroupBy - needs second order functions?]

Resolution: The WG has decided (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)) to skip groupBy for the time being (see also revised [2.8 Restructuring and grouping] and raise [Issue-0069:  Organization of Document] for a possible future revision of this resolution.

Issue-0043:  Recursive Descent for XPath

Date: Aug-01-2000

Description: The very important XPath operator "//" is supported in the algebra only by writing a recursive function. This is adequate for a semantic specification, but if the algebra is intended as an optimizable target language it will need better support for "//" (possibly in the form of a fix-point operator.)

Resolution: Resolved by subsumption under [Issue-0043:  Recursive Descent for XPath] (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)).

Issue-0044:  Keys and IDREF

Date: Aug-01-2000

Description: We think the algebra needs some facility for dereferencing keys and IDREFs (exploiting information in the schema.)

Resolution: Subsumed by [Issue-0007:  References: IDREFS, Keyrefs, Joins]

Issue-0045:  Global Order

Date: Aug-01-2000

Description: We are concerned about absence of support for operators based on global document ordering such as BEFORE and AFTER.

Resolution: subsumed by [Issue-0003:  Global Order]

Issue-0046:  FOR Syntax

Date: Aug-01-2000

Description: We agree with comments made in the face-to-face meeting about the aesthetics of the algebra's syntax for iteration. For example, the following syntax is relatively easy to understand: FOR x IN some_expr EVAL f(x) whereas we find the current algebra equivalent to be confusing and misleading: FOR x <- some_expr IN f(x) This syntax appears to assign the result of some_expr to variable x, and uses the word IN in a non-intuitive way.

Resolution: subsumed by [Issue-0021:  Syntax]

Issue-0047:  Attributes

Date: Aug-01-2000

Description: See [Issue-0001:  Attributes].

Resolution: subsumed by [Issue-0001:  Attributes]

[Issue-0048:  Explicit Type Declarations] through [Issue-0050:  Recursive Descent for XPath]  are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0148.html (W3C-members only)

Issue-0048:  Explicit Type Declarations

Date: Jul-27-2000

Description: Type Declaration for the results of a query: The issue is whether to auto construct the result type from a query or to pre-declare the type of the result from a query and check for correct type on the return value. Suggestion: Support for pre-declared result data type and as well as to coerce the output to a new type is desirable. Runtime or compile time type checking is to be resolved? Once you attach a name to a type, it is preserved during the query processing.

Resolution: W.r.t. compile time type casts this is already possible with e:t (see [3.6 Expressions]). For run-time casts an issue has been raised in [Issue-0062:  Open questions for constructing elements by reference].

Issue-0049:  Unordered Collections

Date: Jul-27-2000

Description: Currently, all forests in the data model are ordered. It may be useful to have unordered forests. The distinct operator, for example, produces an inherently unordered forest. Unordered forests can benefit from many optimizations for the relational algebra, such as commutable joins.

Handling of collection of attributes is easy but the collection of elements is complex due to complex type support for the elements. It makes sense to allow casting from unordered to ordered collection and vice versa. It is not clear whether the new ordered or unordered collection is a new type or not. It affects function resolution, optimization.

See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0135.html (W3C-members only).

Our request to Schema to represent insignificance of ordering at schema level has not been fulfilled - see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0136.html (W3C-members only). Thus we need to be aware that this information may get lost, when mapping to schema.

Issue-0050:  Recursive Descent for XPath

Date: Jul-27-2000

Description: Suggestion: The group likes to add a support for fixed-point operator in the query language that will allow us to express the semantics of the // operator in an xpath expression. A path expression of the form a//b may be represented by a fixed-point operator fp(a, "/.")/b.

Resolution: subsumed by [Issue-0043:  Recursive Descent for XPath]

Issue-0051:  Project redundant?

Date: Aug-05-2000

Description: It appears that project a e could be reduced to sth. like

for v <- e in
  case  v of a[v1] => a[v1]
           | v2 => ()

... or would that generate a less precise type?

Resolution: With the new type system and handling of the for operator, project is indeed redundant. See [4.1 Relating projection to iteration].

Issue-0052:  Axes of XPath

Date: Aug-05-2000

Description: The current algebra makes navigation to parents difficult to impossible. With support of Element Identity [Issue-0005:  Element identity] and recursive functions [Issue-0008:  Fixed point operator or recursive functions] one can express parent() by a recursive function via the document root. More direct support needs to be investigated w.r.t its effect on the type system.

The WG wishes to support a built-in operator parent() (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only)). For the current state of affairs see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0074.html (W3C-members only). For some use-cases see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0011.html (W3C-members only).

Issue-0053:  Global vs. local elements

Date: Aug-05-2000

Description: The current type system cannot represent global element-declarations of XML-Schema. All element declarations are local.

Issue-0054:  Global vs. local complex types

Date: Aug-05-2000

Description: The current type system does not distinguish between global and local types as XML-Schema does. All types appear to be fully nested (i.e. local types)

Issue-0055:  Types with non-wellformed instances

Date: Aug-05-2000

Description: The type system and algebra allows for sequences of simple types, which can usually be not represented as a well-formed document. How shall we constrain this? Related to [Issue-0016:  Mixed content].

Issue-0056:  Operators on Simple Types

Date: Jul-15-2000

Description:  We intentionally did not define equality or relational operators on element and atomic type. These operators should be defined by consensus.

See also first designs for support of arithmetic operators http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0138.html (W3C-members only) and for support of operators for date/time http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0113.html (W3C-members only).

Issue-0057:  More precise type system; choice in path

Date: Aug-07-2000

Description: (This subsumes [Issue-0051:  Project redundant?]). If the type system were more precise, then (project a e) could be replaced by:

for v <- e in
    case v of
      a[v1] => a[v1]
    | v2 => ()

One could also represent (e/(a|b)) directly in a similar style.

for v <- e in
    case v of
      a[v1] => a[v1]
    | v2 => case v2 of
	      b[v3] => b[v3]
            | v4 => ()

Currently, there is no way to represent (e/(a|b)) without loss of precision, so if we do not change the type system, we may need to have some way to represent (e/(a|b)) and similar terms without losing precision. (The LA team has a design for this more precise type system, but it is too large to fit in the margin of this web page!)

Resolution: See resolution of [Issue-0051:  Project redundant?]

Issue-0058:  Downward Navigation only?

Date: Aug-07-2000

Description: Related to [Issue-0052:  Axes of XPath]. The current type system (and the more precise system alluded to in [Issue-0057:  More precise type system; choice in path]) seems well suited for handling XPath children and descendant axes, but not parent, ancestor, sibling, preceding, or following axes. Is this limitation one we can live with?

Resolution: Subsumed by [Issue-0052:  Axes of XPath]

Issue-0059:  Testing Subtyping

Date: Aug-07-2000

Description: One operation required in the algebra is to test whether XML type t1 is a subtype of XML type t2, indicated by writing t1 <: t2. There is a well-known algorithm for this, based on tree automata, which is a straightforward variant of the well-known algorithm for testing whether the language generated by one regular-expression is a subset of the language generated by another. (The algorithm involves generating deterministic automata for both regular expressions or types.)

However, the naive implementation of the algorithm for comparing XML types can be slow in practice, whereas the naive algorithm for regular expressions is tolerably fast. The only acceptably fast implementation of a comparison for XML types that the LA team knows of has been implemented by Haruo Hasoya, Jerome Voullion, and Benjamin Pierce at the University of Pennsylvania, for their implementation of Xduce. (Our implementation of the XML query algebra re-uses their code, with permission.)

So, should we adopt a simpler definition of subtyping which is easier to test? One possibility is to adopt the sibling restriction from Schema, which requires that any two elements which appear a siblings in the same content model must themselves have contents of the same type. Jerome Simeon and Philip Wadler discovered that adopting the sibling restriction reduces the problem of checking subtyping of XML types to that of checking regular languages for inclusion, so it may be worth adopting the restriction for that reason.

Issue-0060:  Internationalization aspects for strings

Date: Jun-26-2000

Description: These issues are taken from the comments on the Requirements Document by I18N (http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jun/0137.html (W3C-members only)).

Further information can be found at http://www.w3.org/TR/WD-charreq.

It is a goal of i18n that queries involving string matching ("select x where x='some_constant'") treat canonically equivalent strings (in the Unicode sense) as matching. If the query and the target are both XML, early normalization (as per the Character Model) is assumed and binary comparison ensures that the equivalence requirement is satisfied. However, if the target is originally a legacy database which logically has a layer that exports the data as XML, that XML must be exported in normalized form. The XML Query spec must impose the normalization requirement upon such layers.

Similarly, the query may come from a user-interface layer that creates the XML query. The XML Query spec must impose the normalization requirement upon such layers.

Provided that the query and the target are in normalized form C, the output of the query must itself be in normalized form C.

Queries involving string matching should support various kinds of loose matching (such as case-insensitivity, katakana-hiragana equivalence, accent-accentless equivalence, etc.)

If such features as case-insensitivity are present in queries involving string matching, these features must be properly internationalized (e.g. case folding works for accented letters) and language-dependence must be taken into account (e.g. Turkish dotless-i).

Queries involving character counting and indexing must take into account the Character Model. Specifically, they should follow Layer 3 (locale-independent graphemes). Additional details can be found in The Unicode Standard 3.0 and UTR#18. Queries involving word counting and indexing should similarly follow the recommendations in these references.

Issue-0061:  Model for References

Date: Aug-16-2000

Description: Raised in: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0063.html (W3C-members only). Related to a number of issues around [Issue-0005:  Element identity].

The following issues have been raised since Sep-25-2000.

Issue-0062:  Open questions for constructing elements by reference

Date: Sep-25-2000

Description: (1) What is the value of parent() when constructing new elements with children refering to original nodes? See also discussion at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only).

(2) Is an approach to either make copies for all children or provide references to all children, or should we allow for a more flexible combination of copies and references?

Issue-0063:  Do we need (user defined) higher order functions?

Date: Oct-16-2000

Description: The current XML-Query-Algebra does not allow functions to be parameters of another function - so called higher order functions. However, most of the algebra operators are (built-in) higher functions, taking expressions as an argument ("sort", "for", "case" to name a few). Even a fixpoint operator, "fun f(x)=e, fix f(x) in e" (see also [Issue-0008:  Fixed point operator or recursive functions]), would be a built-in higher order function.

Resolution: As agreed in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) the XML Query Algebra will not support user defined higher order functions. It does support a number of built-in higher order functions.

Issue-0064:  Error code handling in Query Algebra

Date: Oct-04-2000

Description: How do we return an error code from a function defined in current Query algebra. Do we need to create an array (or a structure) to merge the return value and error code to do this. If that is true, it may be inefficient to implement. In order for cleaner and efficient implementation, it may be necessary to allow a function declaration to take a parameter of type "output" and allow it to return an error code as part of the function definition. See also thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0051.html (W3C-members only).

Resolution:  One does not need to create a structure to combine return values with error codes, provided each operator or function /either/ returns a value /or/ raises an error. The XML-Query Algebra supports means to raise errors, but does not define standard means to catch errors. Raising errors is accomplished by the expression "error" of type Ø (empty choice). Because Ø | t = t, such runtype errors do not influence static typing. The surface syntax and/or detailed specification of operators on simple types (see [Issue-0056:  Operators on Simple Types]) may choose to differentiate errors into several error-codes.

Issue-0065:  Built-In GroupBy?

Date: Oct-16-2000

Description: As discussed in http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), we may revisit the resolution of [Issue-0042:  GroupBy] and reintroduce GroupBy along the lines of sort: "group v in e1 by [e2 {collation}]". One reason for this may be that this allows to use collation for deciding about the equality of strings.

Resolution: In http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) the WG has decided to close this issue, and for the time being not consider GroupBy as a built-in operator. Furthermore, [Issue-0013:  Collations] is ammended to deal with collations for all operators involving a comparison of strings.

Issue-0066:  Shallow or Deep Equality?

Date: Oct-16-2000

Description: What is the meaning of "=" and "distinct"? Equality of references to nodes or deep equality of data?

Issue-0067:  Runtime Casts

Date: Sep-21-2000

Description: In some contexts it may be desirable to cast values at runtime. Such runtime casts lead to an error if a value cannot be cast to a given type. See also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only), where the Algebra team has been put in charge of introducing run-time casts into the Algebra.

Resolution: cast e : t has been introduced as a reducible operator expressed in terms of match (see [4.2 Reducible Expressions]).

Issue-0068:  Document Collections

Date: Oct-16-2000

Description: Per our requirements document we are chartered to support document collections. The current XML-Query Algebra deals with single documents only. There are a number of subissues:

(a) Do we need a more elaborate notion of node-references? E.g. pair of (URI of root-node, local node-ref)

(b) Does the namespace mechanism suffice to type collections of nodes from different documents? Probably yes.

(c) Provided (a) and (b) can be settled, will the approach taken for [Issue-0049:  Unordered Collections] do the rest?

Issue-0069:  Organization of Document

Date: Oct-16-2000

Description: The current document belongs more to the genre (scientific) paper than to the genre specification. One may consider the following modifications: (a) reorganize intro to give a short overview and then state the purpose (strongly typed, neutral syntax with formal semantics as a basis for possibly multiple syntaxes, etc.) (compared to version Aug-23, this version has already gone a good deal in this direction). (b) Equip various definitions and type rules with id's. (c) Elaborate appendices on mapping XML-Query-Algebra Model vs. XML-Query-Datamodel, XML-Query-Type System vs. XML-Schema-Type System. (d) Maybe add an appendix on use-case-solutions. The problem is of course: Part of this is a lot of work, and we may not achieve all for the first release.

Resolution: At http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) the WG decided to dispose of this issue. The current overall organization of the document is quite adequate, but of course editorial decisions will have to made all the time.

Issue-0070:  Stable vs. Unstable Sort/Distinct

Date: Oct-02-2000

Description: Should sort (and distinct) be stable on ordered collections, i.e. lists, and unstable on unordered collections (see [Issue-0049:  Unordered Collections])? For more details see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0007.html (W3C-members only).

Issue-0071:  Alignment with the XML Query Datamodel

Date: Sep-26-2000

Description: Currently, the XML Query Algebra Datamodel does not model PI's and comments. For more details see thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0167.html (W3C-members only).

Issue-0072:  Facet value access in Query Algebra

Date: Oct-04-2000

Description: Each of the date-time data types have facet values as defined by the schema data types draft spec. This problem is general enough to be applied to other atomic data types.

The question is : Should we provide access to these facet values on an instance of a particular data types? If so, what type of access? My take is the facets are to be treated like read-only attributes of a data instance and one should have a read access to them. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0044.html (W3C-members only).

Issue-0073:  Facets for simple types and their role for typechecking

Date: Oct-16-2000

Description: XML-Schema introduces a number of constraining facets http://www.w3.org/TR/xmlschema-2/#dc-facets for simple types (among them: length, pattern, enumeration, ...). We need to figure out whether and how to use these constraining facets for type-checking. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0146.html (W3C-members only).

Issue-0074:  Operational semantics for expressions

Date: Nov-16-2000

Description: It is necessary to add an operational semantics that formally defines each operator in the Algebra.

Issue-0075:  Overloading user defined functions

Date: Nov-17-2000

Description: User defined functions can not be overloaded in the XML Query Algebra, i.e., a function is exclusively identified by its name, and not by its signature. Should this restriction be relaxed and if so - to which extent?

B.3 Alphabetic list of issues

B.3.1 Open Issues

Number: 36

[Issue-0015:  3-valued logic to support NULLs]
[Issue-0018:  Align algebra types with schema]
[Issue-0071:  Alignment with the XML Query Datamodel]
[Issue-0030:  Automatic type coercion]
[Issue-0052:  Axes of XPath]
[Issue-0013:  Collations]
[Issue-0038:  Copy by reachability]
[Issue-0068:  Document Collections]
[Issue-0009:  Externally defined functions]
[Issue-0073:  Facets for simple types and their role for typechecking]
[Issue-0072:  Facet value access in Query Algebra]
[Issue-0008:  Fixed point operator or recursive functions]
[Issue-0032:  Full regular path expressions]
[Issue-0028:  Fusion]
[Issue-0003:  Global Order]
[Issue-0054:  Global vs. local complex types]
[Issue-0053:  Global vs. local elements]
[Issue-0060:  Internationalization aspects for strings]
[Issue-0033:  Metadata Queries]
[Issue-0016:  Mixed content]
[Issue-0061:  Model for References]
[Issue-0062:  Open questions for constructing elements by reference]
[Issue-0074:  Operational semantics for expressions]
[Issue-0056:  Operators on Simple Types]
[Issue-0075:  Overloading user defined functions]
[Issue-0014:  Polymorphic types]
[Issue-0007:  References: IDREFS, Keyrefs, Joins]
[Issue-0066:  Shallow or Deep Equality?]
[Issue-0070:  Stable vs. Unstable Sort/Distinct]
[Issue-0020:  Structural vs. name equivalence]
[Issue-0019:  Support derived types]
[Issue-0059:  Testing Subtyping]
[Issue-0055:  Types with non-wellformed instances]
[Issue-0049:  Unordered Collections]
[Issue-0029:  Views]
[Issue-0011:  XPath tumbler syntax instead of index?]

B.3.2 Resolved (or redundant) Issues

Number: 39

[Issue-0001:  Attributes]
[Issue-0047:  Attributes]
[Issue-0065:  Built-In GroupBy?]
[Issue-0027:  Case syntax]
[Issue-0040:  Case Syntax]
[Issue-0023:  Catch exceptions and process in algebra?]
[Issue-0010:  Construct values by copy]
[Issue-0037:  Copy vs identity semantics]
[Issue-0039:  Dereferencing semantics]
[Issue-0063:  Do we need (user defined) higher order functions?]
[Issue-0058:  Downward Navigation only?]
[Issue-0005:  Element identity]
[Issue-0064:  Error code handling in Query Algebra]
[Issue-0035:  Exception handling]
[Issue-0048:  Explicit Type Declarations]
[Issue-0046:  FOR Syntax]
[Issue-0034:  Fusion]
[Issue-0045:  Global Order]
[Issue-0036:  Global-order based operators]
[Issue-0012:  GroupBy - needs second order functions?]
[Issue-0042:  GroupBy]
[Issue-0022:  Indentation, Whitespaces]
[Issue-0044:  Keys and IDREF]
[Issue-0057:  More precise type system; choice in path]
[Issue-0002:  Namespaces]
[Issue-0069:  Organization of Document]
[Issue-0026:  Project - one tag only]
[Issue-0051:  Project redundant?]
[Issue-0043:  Recursive Descent for XPath]
[Issue-0050:  Recursive Descent for XPath]
[Issue-0031:  Recursive functions]
[Issue-0004:  References vs containment]
[Issue-0067:  Runtime Casts]
[Issue-0041:  Sorting]
[Issue-0006:  Source and join syntax instead of "for"]
[Issue-0021:  Syntax]
[Issue-0025:  Treatment of empty results at type level]
[Issue-0017:  Unordered content]
[Issue-0024:  Value for empty forests]