The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
SEARCH | ABOUT | INDEX | NEWS | CORE STANDARDS | TECHNOLOGY REPORTS | EVENTS | LIBRARY
SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic
Created: July 12, 2004.
News: Cover StoriesPrevious News ItemNext News Item

W3C Releases Public Working Draft for Full-Text Searching of XML Text and Documents.

W3C has published an initial Public Working Draft for XQuery 1.0 and XPath 2.0 Full-Text. Created as a joint specification by the W3C XML Query Working Group and the XSL Working Group as part of the XML Activity, this new draft specification defines a language that extends XQuery 1.0 and XPath 2.0 with full-text search capabilities.

As defined by the draft, "full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces." New full-text search facility is implemented by extending the XQuery and XPath languages to support a new "FTContainsExpr" expression and a new "ft:score" function.

Expressions of the type FTSelection are composed of:(1) words or combinations of words that are the search strings to be found as matches; (2) Match options such as case sensitivity or an indication to use stop words; (3) Boolean operators that allow composition of an FTSelection from simpler FTSelections; (4) Positional constraints such as indication of match distance or window.

The new Full-Text Working Draft endeavors to meet search requirements specified in an updated companion draft XQuery 1.0 and XPath 2.0 Full-Text Use Cases. This document provides use cases designed to "illustrate important applications of full-text querying within an XML query language. Each use case exercises a specific functionality relevant to full-text querying. An XML Schema and sample input data are provided; each use case specifies a query applied to the input data, a solution in XQuery, a solution in XPath (when possible), and the expected results."

Full-text query designed as an extension of XQuery and XPath will support several kinds of searches not possible using simple substring matching. It allows precision querying of XML documents containing "highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags).

Language-based query and token-based searches are also supported; for example, find all the news items that contain a word with the same linguistic stem as the English word "mouse" — which finds occurrences of both "mouse" and "mice" together with possessive forms.

Tokenization serves as the basis for full-text search in the W3C draft. Words, spaces, and punctuation are distinguished. A "word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried; consecutive words need not be separated by either punctuation or space, and words may overlap; a phrase is a sequence of ordered words which can contain any number of words." This model "enables functions and operators which work with the relative positions of words (e.g., proximity operators). It uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word, e.g., wildcards, stemming."

The W3C XQuery and XSL Working Groups invite public comment on the two full-text query drafts.

Bibliographic Information

Introduction to Full-Text Search and XML

According to the new W3C Working Draft, "full-text search is different from [simple, brute-force] substring search in many ways:

  • A full-text search searches for phrases (a sequence of words) rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the phrase "lease" will not.

  • There is an expectation that a full-text search will support language- and token-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a word with the same linguistic stem as "mouse" (finds "mouse" and "mice"). An example of a token-based search is "find me all the news items that contain the word "XML" within 3 words (tokens) of "Query".

  • Full-text search is subject to the vageries and nuances of language. The results it returns are often of varying usefulness. When you search a web site for all cameras that cost less than $100, this is an exact search. There is a set of cameras that match this search, and a set that do not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for, say, all the news items that contain the word "mouse", you probably expect to find news items with the word "mice", and possibly "rodents" (or possibly "computers"!). But not all results are equal : some results are more "mousey" than others. Because full-text search can be inexact, we have the notion of score or relevance : we generally expect to see the most relevant results at the top of the results list. Of course, relevance is in the eye of the beholder. Note: as XQuery/XPath evolves, it may apply the notion of score to querying structured search. For example, when making travel plans or shopping for cameras, it is sometimes more useful to get an ordered list of near-matches. If XQuery/XPath defines a generalized inexact match, we assume that XQuery/XPath can utilize the scoring framework provided by the full-text language.

  • As XML becomes mainstream, users expect to be able to store and search all their documents in XML. This requires a standard way to do full-text search, as well as structured searches, against XML documents. A similar requirement for full-text search led ISO to define the SQL/MM-FT standard. SQL/MM-FT defines extensions to SQL to express full-text queries providing similar functionality as this full-text language extension to XQuery 1.0/XPath 2.0 does.

  • Full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces.

  • A word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which can contain any number of words.

  • Tokenization enables functions and operators which work with the relative positions of words (e.g., proximity operators). It also uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word (e.g., wildcards, stemming).

  • [This working draft uses] the namespace ft (for full-text) that corresponds to the URL http://www.w3.org/2004/07/xquery-full-text and defines the namespace of full-text search... [from the Section 1 'Introduction']

Introduction to Full-Text Use Cases

The use cases documented in XQuery 1.0 and XPath 2.0 Full-Text Use Cases "were created by XML Query and XSL Working Groups to illustrate important applications of full-text querying within an XML query language. Each use case exercises a specific functionality relevant to full-text querying. An XML Schema and sample input data are provided. Each use case specifies a query applied to the input data, a solution in XQuery, a solution in XPath (when possible), and the expected results...

These use cases:

  1. Present some possible functions and features for tokenized text support in XQuery and XPath. None are yet available in XQuery or XPath...

  2. Illustrate simple and complex queries. The more complex queries would normally only be constructed by programmers, librarians, and other expert users, or provided for novice users via saved queries and graphical user interfaces. Each query is intended to illustrate a single functionality, although queries might overlap in their functionalities (e.g., phrases and ordered distance queries allowing no intervening words). Overlapping and similar functionalities are noted in the comments on query behavior.

  3. Draw from sample data which are almost entirely in English. Use cases in other languages are solicited, especially where they illustrate language-specific implementations of functions and features. Among the most sought after are use cases for queries using prefix and infix wild cards, proximity queries, and operators and queries requiring functionality which may not have Western language equivalents.

  4. Include queries which in most instances can be written with pure Boolean full-text predicates or with scoring (e.g., scoring on the number of occurrences of a word or phrase, scoring on how close words are to one another within a distance query, scoring on how similar a word is to the one being stemmed). A few, those in Section 17, cannot be written with Boolean full-text predicates. Scoring methodologies will not be defined in this recommendation. Scoring will be implementation defined. Results are provided in document order, except those in Section 17. Results could be returned ordered differently, such as by relevance (based on implementation defined scoring) or explicitly by element...

  5. Query element content. One may see Section 4 for explicit queries on attribute values.

  6. Include queries which are case-insensitive. When returning a paragraph, the text is returned as it occurs in the data model. This approach was chosen to keep the sample data short and the expected results meaningful. It would have be equally valid to return only the character queried...

  7. Include queries which when they target XML elements are understood, unless otherwise stated, to query text within any text node descendant of the element.

  8. Include queries which return only elements and attributes which meet all the conditions specified in the query. In particular, Boolean queries return results where the Boolean conditions in the query are satisfied, i.e., are used to select what is being returned to users. Query results may be returned in different ways. From a query for books containing the word "usability", users might be interested in returning, for each book containing the word "usability", its number and its entire content. In another situation for the same query, users might be interested in returning, for each book containing the word "usability", its number and only the elements and attributes in the content which contain the word "usability". As in this second situation, the queries in these use cases return only elements and attributes which meet all the conditions specified in the query. The Return clause may also include additional or different elements and attributes if specified, and may construct new elements.

  9. Include queries which provide some of the basic functionality of fuzzy match querying (e.g., wildcards, stemming, thesaurus support, proximity).

  10. Provide highlighting of found words and phrases in the expected results of queries as an aid to users. The presence of highlighting says nothing about whether highlighting will be a feature of XQuery or XPath full-text querying.

  11. Display query solutions in XQuery and when possible in XPath. Queries which may not be written in XPath include those which contain element constructors, and cannot be written without let and order by clauses.

About the W3C XML Query Project

"The mission of the XML Query Project is to provide flexible query facilities to extract data from real and virtual documents on the World Wide Web, therefore finally providing the needed interaction between the Web world and the database world. Ultimately, collections of XML files will be accessed like databases. The ambitious task of the XML Query (XQuery) Working Group is therefore to develop the first world standard for querying web documents, following the incredibly successful discussion started at the QL'98 event. However, the XML Query (XQuery) project is all-around, and also includes in its efforts not only the standard for querying XML documents, but also the next-generation standards for doing XML selection (XPath2), for doing XML serialization, for doing Full-Text Search, for providing a possible functional XML Data Model, and for providing a standard set of functions and operators for manipulating web data..." [from the XQuery page]

Principal references:


Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation

Primeton

XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI: http://xml.coverpages.org/ni2004-07-12-a.html  —  Legal stuff
Robin Cover, Editor: robin@oasis-open.org