W3C has published an initial Public Working Draft for XQuery 1.0 and XPath 2.0 Full-Text. Created as a joint specification by the W3C XML Query Working Group and the XSL Working Group as part of the XML Activity, this new draft specification defines a language that extends XQuery 1.0 and XPath 2.0 with full-text search capabilities.
As defined by the draft, "full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces." New full-text search facility is implemented by extending the XQuery and XPath languages to support a new "FTContainsExpr" expression and a new "ft:score" function.
Expressions of the type FTSelection are composed of:(1) words or combinations of words that are the search strings to be found as matches; (2) Match options such as case sensitivity or an indication to use stop words; (3) Boolean operators that allow composition of an FTSelection from simpler FTSelections; (4) Positional constraints such as indication of match distance or window.
The new Full-Text Working Draft endeavors to meet search requirements specified in an updated companion draft XQuery 1.0 and XPath 2.0 Full-Text Use Cases. This document provides use cases designed to "illustrate important applications of full-text querying within an XML query language. Each use case exercises a specific functionality relevant to full-text querying. An XML Schema and sample input data are provided; each use case specifies a query applied to the input data, a solution in XQuery, a solution in XPath (when possible), and the expected results."
Full-text query designed as an extension of XQuery and XPath will support several kinds of searches not possible using simple substring matching. It allows precision querying of XML documents containing "highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags).
Language-based query and token-based searches are also supported; for example, find all the news items that contain a word with the same linguistic stem as the English word "mouse" — which finds occurrences of both "mouse" and "mice" together with possessive forms.
Tokenization serves as the basis for full-text search in the W3C draft. Words, spaces, and punctuation are distinguished. A "word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried; consecutive words need not be separated by either punctuation or space, and words may overlap; a phrase is a sequence of ordered words which can contain any number of words." This model "enables functions and operators which work with the relative positions of words (e.g., proximity operators). It uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word, e.g., wildcards, stemming."
The W3C XQuery and XSL Working Groups invite public comment on the two full-text query drafts.
Bibliographic Information
XQuery 1.0 and XPath 2.0 Full-Text. W3C Working Draft. 09-July-2004. Edited by Sihem Amer-Yahia (AT&T Labs - Research), Chavdar Botev (Invited Expert), Stephen Buxton (Oracle Corporation), Pat Case (Library of Congress), Jochen Doerre (Invited Expert, IBM), Darin McBeath (Elsevier), Michael Rys (Microsoft), Jayavel Shanmugasundaram (Invited Expert). Version URL: http://www.w3.org/TR/2004/WD-xquery-full-text-20040709/. Latest version URL: http://www.w3.org/TR/xquery-full-text/.
XQuery 1.0 and XPath 2.0 Full-Text Use Cases. W3C Working Draft. 09-July-2004. Edited by Sihem Amer-Yahia (AT&T Labs - Research) and Pat Case (Library of Congress). Version URL: http://www.w3.org/TR/2004/WD-xmlquery-full-text-use-cases-20040709/. Latest version URL: http://www.w3.org/TR/xmlquery-full-text-use-cases. Previous version URL: http://www.w3.org/TR/2003/WD-xmlquery-full-text-use-cases-20030214/.
Introduction to Full-Text Search and XML
According to the new W3C Working Draft, "full-text search is different from [simple, brute-force] substring search in many ways:
A full-text search searches for phrases (a sequence of words) rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the phrase "lease" will not.
There is an expectation that a full-text search will support language- and token-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). An example of a token-based search is "find me all the news items that contain the word "XML" within 3 words (tokens) of "Query".
Full-text search is subject to the vageries and nuances of language. The results it returns are often of varying usefulness. When you search a web site for all cameras that cost less than $100, this is an exact search. There is a set of cameras that match this search, and a set that do not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for, say, all the news items that contain the word "mouse", you probably expect to find news items with the word "mice", and possibly "rodents" (or possibly "computers"!). But not all results are equal : some results are more "mousey" than others. Because full-text search can be inexact, we have the notion of score or relevance : we generally expect to see the most relevant results at the top of the results list. Of course, relevance is in the eye of the beholder. Note: as XQuery/XPath evolves, it may apply the notion of score to querying structured search. For example, when making travel plans or shopping for cameras, it is sometimes more useful to get an ordered list of near-matches. If XQuery/XPath defines a generalized inexact match, we assume that XQuery/XPath can utilize the scoring framework provided by the full-text language.
As XML becomes mainstream, users expect to be able to store and search all their documents in XML. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT standard. SQL/MM-FT defines extensions to SQL to express full-text queries providing similar functionality as this full-text language extension to XQuery 1.0/XPath 2.0 does.
Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.
A word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which can contain any number of words.
Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming).
[This working draft uses] the namespace ft (for full-text) that corresponds to the URL http://www.w3.org/2004/07/xquery-full-text and defines the namespace of full-text search... [from the Section 1 'Introduction']
Introduction to Full-Text Use Cases
The use cases documented in XQuery 1.0 and XPath 2.0 Full-Text Use Cases "were created by XML Query and XSL Working Groups to illustrate important applications of full-text querying within an XML query language. Each use case exercises a specific functionality relevant to full-text querying. An XML Schema and sample input data are provided. Each use case specifies a query applied to the input data, a solution in XQuery, a solution in XPath (when possible), and the expected results...
These use cases:
Present some possible functions and features for tokenized text support in XQuery and XPath. None are yet available in XQuery or XPath...
Illustrate simple and complex queries. The more complex queries would normally only be constructed by programmers, librarians, and other expert users, or provided for novice users via saved queries and graphical user interfaces. Each query is intended to illustrate a single functionality, although queries might overlap in their functionalities (e.g., phrases and ordered distance queries allowing no intervening words). Overlapping and similar functionalities are noted in the comments on query behavior.
Draw from sample data which are almost entirely in English. Use cases in other languages are solicited, especially where they illustrate language-specific implementations of functions and features. Among the most sought after are use cases for queries using prefix and infix wild cards, proximity queries, and operators and queries requiring functionality which may not have Western language equivalents.
Include queries which in most instances can be written with pure Boolean full-text predicates or with scoring (e.g., scoring on the number of occurrences of a word or phrase, scoring on how close words are to one another within a distance query, scoring on how similar a word is to the one being stemmed). A few, those in Section 17, cannot be written with Boolean full-text predicates. Scoring methodologies will not be defined in this recommendation. Scoring will be implementation defined. Results are provided in document order, except those in Section 17. Results could be returned ordered differently, such as by relevance (based on implementation defined scoring) or explicitly by element...
Query element content. One may see Section 4 for explicit queries on attribute values.
Include queries which are case-insensitive. When returning a paragraph, the text is returned as it occurs in the data model. This approach was chosen to keep the sample data short and the expected results meaningful. It would have be equally valid to return only the character queried...
Include queries which when they target XML elements are understood, unless otherwise stated, to query text within any text node descendant of the element.
Include queries which return only elements and attributes which meet all the conditions specified in the query. In particular, Boolean queries return results where the Boolean conditions in the query are satisfied, i.e., are used to select what is being returned to users. Query results may be returned in different ways. From a query for books containing the word "usability", users might be interested in returning, for each book containing the word "usability", its number and its entire content. In another situation for the same query, users might be interested in returning, for each book containing the word "usability", its number and only the elements and attributes in the content which contain the word "usability". As in this second situation, the queries in these use cases return only elements and attributes which meet all the conditions specified in the query. The Return clause may also include additional or different elements and attributes if specified, and may construct new elements.
Include queries which provide some of the basic functionality of fuzzy match querying (e.g., wildcards, stemming, thesaurus support, proximity).
Provide highlighting of found words and phrases in the expected results of queries as an aid to users. The presence of highlighting says nothing about whether highlighting will be a feature of XQuery or XPath full-text querying.
Display query solutions in XQuery and when possible in XPath. Queries which may not be written in XPath include those which contain element constructors, and cannot be written without let and order by clauses.
About the W3C XML Query Project
"The mission of the XML Query Project is to provide flexible query facilities to extract data from real and virtual documents on the World Wide Web, therefore finally providing the needed interaction between the Web world and the database world. Ultimately, collections of XML files will be accessed like databases. The ambitious task of the XML Query (XQuery) Working Group is therefore to develop the first world standard for querying web documents, following the incredibly successful discussion started at the QL'98 event. However, the XML Query (XQuery) project is all-around, and also includes in its efforts not only the standard for querying XML documents, but also the next-generation standards for doing XML selection (XPath2), for doing XML serialization, for doing Full-Text Search, for providing a possible functional XML Data Model, and for providing a standard set of functions and operators for manipulating web data..." [from the XQuery page]
Principal references:
- XQuery 1.0 and XPath 2.0 Full-Text. W3C Working Draft. 09-July-2004.
- XQuery 1.0 and XPath 2.0 Full-Text Use Cases. W3C Working Draft. 09-July-2004.
- W3C news item
- Mail Archives for the W3C public list 'www-ql@w3.org'. A public mailing list on query languages, including (but not limited to) discussion on the XML-Query project. Subscribed users may post to this list.
- Mail Archives for W3C public list 'public-qt-comments@w3.org'. This QT (Query and Transform) list is for public feedback on the following W3C specifications published by the XML Query and XSL Working Groups: XQuery 1.0, XSLT 2.0, XPath 2.0, XQuery 1.0 and XPath 2.0 Data Model, XQuery 1.0 and XPath 2.0 Functions and Operators Version 1.0.
- Contact: Massimo Marchiori (W3C Contact for XML Query)
- XML Query Project (XQuery)
- W3C XSL Working Group
- W3C Extensible Markup Language (XML) Activity Statement
- W3C XML Home Page
- "XML and Query Languages" - Main reference page.