com.futurexpert.xpert.xretrieve
Class Xretrieve

java.lang.Object
  |
  +--com.futurexpert.xpert.xindex.Xretrieve


public class Xretrieve
extends java.lang.object
The Xretrieve class represents the primary retrieval class. You can retrieve XML elements with creating an Xretrieve object. A simple example is:
     	Xretrieve xre = new Xretrieve();
     	Xre.initialize(coll);
 

After creating the retrieval object, you have to load some portion of indexes by calling the method initialize(). The initialization step is necessary just once for each XML collection named coll. So, if you want to issue 10 queries for an XML collection (or file), then you have to initialize just once before performing the first query. You don't need to call initialize() for the folloing queries. The initialization step takes mili-seconds to several seconds, depending on the size of the collection.

	
As the class Xindex allows the user to choose their own stemming class and stop word files, Xretrieve offers such kind of flexibility. But differently from Xindex, Xretrieve allows users to set only the stemming class. In retrieval, you don't need to set the stopwords. And note that you have to use the same stemming class in both Xindex and Xretrieve. Otherwise, you will get unexpected answers. For instance, suppose that you use the porter stemming class in indexing and Stemmer class (In fact Stemmer class does not do any stemming) in retrieval. If you issue a query "//SECTION(contains(PARA, "query"))" in retrieval, then you will never get relevant SECTIONs even though some PARA elements have the word "query". It is because the porter stemmer transforms "query" into "queri" and index them as "queri". So unless you use the porter stemmer in retrieval, your query "query" never match with the stem "queri".
You can evaluate a query by calling a method evaluate() and get the element content by calling getElement() or getElementHead() . Here is a complete code that initializes a collection, evaluates a query and get element contents.
	Xretrieve xre = new Xretrieve();
	Xre.initialize(coll);
	Vector result = Xre.evaluate(XPQLquery);
	for(int i = 0; i < result.size(); i++) {
		ElementInfo ei = (ElementInfo) result.elementAt(i);
		String content = Xre.getElement(ei);
		int score = ei.score;
		System.out.println(score);
		System.out.println(content);
	}
 
The evaluate() methods gets a string query, evaluate the XPQL query against the collection named coll and returns a vector of ElementInfo object. The ElementInfo object has the information of the retrieved object such as the path information, start and end postion in the collection file and score. With an ElementInfo object, you can get the content string of the element and score that informs how much the element is similar to the given query.
The score in an ElementInfo object is an integer value that represents the similarity of the query and the element. Note that the score is always 1 except the query has "contains()" or "in()" function. If you use a "contains()" or "in()" function inside a predicate, the score represents the similarity based upon the number of occurrences of the words. If the functions contain just one word, such as "//SECTION[contains(PARA, "XML")]" or "//SECTION[ in("XML", PARA)]", the score is exactly the number of occurrences of the word appearing in the condition node. For instance, with the previous query, if you get a score 3, then it means that PARA elements of //SECTION has three occurrences of "XML". Note that even though the "//SECTION" has another occurrence of "XML" outside PARA, they do not count.
 
Another thing you have to remember is that XPERT only retrieves the outermost element when more than one elements nested together are found. For instance, if an element <PARA> is found, but it contains another <PARA> elements relevant to the query "//PARA[contains(., "XPERT")]" as:
<PARA>
	XPERT
	<PARA>
		XPERT
	</PARA>
</PARA>

then XPERT only returns the outer <PARA> instead of returning both. XPERT considers the efficiency more important than the completeness. If you want to find the nested element, you can search the retrieved result instead, which gives you the possibility to reach the completeness.



XPQL is based on XPath and support the abbreviated form of XPath. XPQL supports most of XPath abbreviate form with some limitation. Even though XPath is powerful query language for style, a complicated query is difficult to understand and moreover to get an efficient implementation. Instead of providing such a expressiveness, we impose some limitation and get the clarity and efficient implementation. In fact, XPQL is not a subset of XPath, but has a powerful "in" operator that provides a powerful retrieval function, which is not defined in the original XPath definition. The "in" supports "*" and phrase retrieval and "and", "or" operators, whereas the "contains" operator in XPath just supports the exact match of strings.
Of course, XPQL is evolving and can be changed with needs. So your feedback is very important in making XPQL better! Note that XPQL is not a subset of XPath. It has "in( )" function which is helpful in information retrieval. In addition, we will support a couple of features such as 'range' in the near future, which are not supported in XPath.
 
The current limitations of XPQL against XPath abbreviated forms are:
(1) a test node can have at most one predicate
(2) a predicate cannot appear inside another predicate
(3) a test node should be an ancestor of condition nodes appearing in predicate.
(4) at present some operators cannot appear at some position
(5) numbers appearing inside a predicate should be positive integer.
(6) one side of the equality and relation operator (=, <, >, <=, >=) should be a literal
(7) join operation is not supported yet.
(8) test node should be elements.
(9) at present, only three functions, "contains()", "in()", and last() are supported.
(10)'*' representing 'anything' is not supported
 
With (1), a query like "//researchers/person[@name = "Shin"][@loc = "Bethesda"]" is invalid, even though it is valid in XPath. You have to convert the query to "//researchers/person[@name = "Shin" and @loc = "Bethesda"]"
 
With (2), a query like "//researchers/person[@name = //salesperson/person/name[@id > "8080"]])" is invalid, since it has a nested predicate inside another predicate.
 
With (3), a query like "//SECTION/TITLE[contains(../PARA, "XML")]" is invalid since the test node "//SECTION/TITLE" is not an ancestor of the condition node "//SECTION/PARA". You have to convert the query to "//SECTION[contains(PARA, "XML")]/TITLE", where the test node "//SECTION" is an ancestor of the condition node "//SECTION/PARA".
 
With (4), an operator "|" (Union operator) is only allowed for connecting path expression outside predicates. So the queries "//a | //b | //c" is valid, where the query "//a or //b or //c" is invalid. Inside predicates, "and" and "or" operators are allowed as well as "|". So the query "//a[1 and in("infor*", b) or last( )]" is valid.
 
With (5), a query like "//a[-1]" or "//a[1.1]" is invalid. The number inside a predicate should be positive. Hence, a query like "//a[5]" is valid. Note that the number and the literal are different with each other as in XPath. A number is a number without enclosed by double quote ("). On the other hand, a literal is a string enclosed by double quote. Hence, 1 is a number, whereas "1" is a literal.
 
With (6) and (7), comparison inside predicates is limited at present. At present, one argument should be literal for the equality and comparison operator. For instance, a query like "//a[@id > "100"]" or "//a[title = "XPERT"]" is valid, whereas another like "//a[@id=//b/@id]" is invalid. At present, semi-join is not allowed. But it will be supported later.
 
At present, only element nodes can be test node. Hence a query like "/person/@first-name[@last-name = "shin"]" is invalid. You have to write a query like "/person[@last-name = "shin"]" and search the element content. It will be supported in a later version.
At present, only three functions, "contains()", "in()", and "last()" are supported. More functions are supported later.
You can look at the current definition of XPQL and understand which query is supported at present.
 
 


 

Constructor Summary

Xretrieve( )

 Initialize the retrieval object

 

Method Summary

void

initialze (String collection )
          It reads a portion of index information for the collection into memory.

 void

setStemClass (Stemmer stem)
          Set a customized stemmer class.

 Vector

evaluate(String XPQLquery)

Evaluate the XPQLquery and returns a vector of ElementInfo objects

 String

getElement (ElementInfo ei)
          It returns the string content of the element corresponding to ei

 String

getElementHead (ElementInfo ei, int length);
          It returns the first 'length' bytes of the string content of the element corresponding to ei

 String

getFileName (ElementInfo);
          It returns the full path name of the file that contains the element.

   

Constructor Detail

Xretrieve( )

public Xretrieve()
Initializes a new retrieval object.


 

Method Detail

initialize

public void initialize(String coll)
Initialize the collection name coll, which reads a portion of index into memory. You have to index the collection prior to calling the method.


setStemClass

public void setStemClass(Stemmer stemmer)
Set a customized stemmer class, which should be given as the full path name such as "com.futurexpert.xpert.xindex.Porter". Note that you have to use the same class as the one you use in indexing. The class should be a subclass of Stemmer. If it is set to null, the Stemmer (the root class of the stemming classes are used. In fact, Stemmer do not perform any stemming) is used. If you don't call the method at all, then default porter stemmer is installed.
Parameters:
Stemmer - s subclass of Stemmer
 


evaluate

public Vector evaluate(String XPQLquery)
Given a String type XPQL query, the method evaluates the query and returns a vector of ElementInfo objects. With ElementInfo objects, you can get the score and the content of the element.
Parameters:
XPQLquery -a String type XPQL query.
 


getElement

public String getElement(ElementInfo ei)
The method gets the string content of the element ei referring to the collection file. The ElementInfo class is defined as:
public class ElementInfo {
…
public int score;
…
}
 
Using these information, the method access the start location where the element is in the collection file (whose extension name is "coll") and get the string content of the element. You can get the similarity vale from score field.
 
 


getElementHead

public String getElementHead(ElementInfo ei, int length)
The method gets the first "length" bytes of string content of the element ei. If you want to build an XML information retrieval system using XPERT, you can first show the head information of the retrieved element by using this method. And if a user wants to read the whole content, then you may want to callgetElement() later.
 


getFileName

public String getFileName(ElementInfo ei)
The method gets the full path name of the file that contains the element ei. If you want to access the file directly, without using getElement() you can use this method and access the XML file that contain the element directly.