com.futurexpert.xpert.xretrieve
Class Xretrieve
java.lang.Object
|
+--com.futurexpert.xpert.xindex.Xretrieve
- public class Xretrieve
- extends
java.lang.object
The Xretrieve
class represents the primary retrieval class. You can retrieve XML elements with creating an Xretrieve object. A simple example is:
Xretrieve xre = new Xretrieve();
Xre.initialize(coll);
After creating the retrieval object,
you have to load some portion of indexes by calling the method
initialize(). The initialization step is necessary just once for each XML collection named coll. So, if you want to issue 10 queries for an XML collection (or file), then you have to initialize just once before performing the first query. You don't need to call initialize() for the folloing queries. The initialization step takes mili-seconds to several seconds, depending on the size of the collection.
- As the class
Xindex
allows the user to choose their own stemming class and stop word files,
Xretrieve offers such kind of flexibility.
But differently from Xindex, Xretrieve allows users to set only the stemming class. In retrieval, you don't need to set the stopwords.
And note that you have to use the same stemming class in both
Xindex and Xretrieve. Otherwise, you will get unexpected answers.
For instance, suppose that you use the porter stemming class in indexing and Stemmer class (In fact Stemmer class does not do any stemming) in retrieval.
If you issue a query "//SECTION(contains(PARA, "query"))" in retrieval, then you will never get relevant SECTIONs even though some PARA elements have the word "query". It is because the porter stemmer transforms "query" into "queri" and index them as "queri". So unless you use the porter stemmer in retrieval, your query "query" never match with the stem "queri".
You can evaluate a query by calling a method evaluate() and get the element content by calling getElement() or getElementHead()
. Here is a complete code that initializes a collection, evaluates a query and get element contents.
Xretrieve xre = new Xretrieve();
Xre.initialize(coll);
Vector result = Xre.evaluate(XPQLquery);
for(int i = 0; i < result.size(); i++) {
ElementInfo ei = (ElementInfo) result.elementAt(i);
String content = Xre.getElement(ei);
int score = ei.score;
System.out.println(score);
System.out.println(content);
}
-
- The
evaluate()
methods gets a string query, evaluate the XPQL query against the
collection named coll
and returns a vector of ElementInfo object. The ElementInfo
object has the information of the retrieved object such as the path information, start and end postion in the collection file and score. With an ElementInfo object, you can get the content string of the element and score that informs how much the element is similar to the given query.
The score in an ElementInfo object is an integer value that represents the similarity of the query and the element. Note that the score is always 1 except the query has "contains()" or "in()" function. If you use a "contains()" or "in()" function inside a predicate, the score represents the similarity based upon the number of occurrences of the words. If the functions contain just one word,
such as "//SECTION[contains(PARA, "XML")]" or "//SECTION[ in("XML", PARA)]", the score is exactly the number of occurrences of the word appearing in the condition node. For instance, with the previous query, if you get a score 3, then it means that PARA elements of //SECTION has three occurrences of "XML". Note that even though the "//SECTION" has another occurrence of "XML" outside PARA, they do not count.
Another thing you have to remember is that
XPERT only retrieves the outermost element
when more than one elements nested together are found.
For instance, if an element <PARA> is found,
but it contains another <PARA> elements relevant to
the query "//PARA[contains(., "XPERT")]" as:
<PARA>
XPERT
<PARA>
XPERT
</PARA>
</PARA>
then XPERT only returns the outer <PARA> instead of returning both. XPERT considers
the efficiency more important than the completeness.
If you want to find the nested element, you can search the retrieved result instead,
which gives you the possibility to reach the completeness.
XPQL is based on XPath and support the abbreviated form of XPath. XPQL supports most of XPath abbreviate form with some limitation.
Even though XPath is powerful query language for style, a complicated query is difficult to understand and moreover to get an efficient implementation.
Instead of providing such a expressiveness, we impose some limitation and get the clarity and efficient implementation. In fact, XPQL is not a subset of XPath, but has a powerful "in" operator that provides a powerful retrieval function, which is not defined in the original XPath definition. The "in" supports "*" and phrase retrieval and "and", "or" operators, whereas the "contains" operator in XPath just supports the exact match of strings.
Of course, XPQL is evolving and can be changed with needs. So your feedback is very important in making XPQL better! Note that XPQL is not a subset of XPath.
It has "in( )" function which is helpful in information retrieval. In addition, we will support a couple of features such as 'range' in the near future, which are not supported in XPath.
The current limitations of XPQL against XPath abbreviated forms are:
(1) a test node can have at most one predicate
(2) a predicate cannot appear inside another predicate
(3) a test node should be an ancestor of condition nodes appearing in predicate.
(4) at present some operators cannot appear at some position
(5) numbers appearing inside a predicate should be positive integer.
(6) one side of the equality and relation operator (=, <, >, <=, >=) should be a literal
(7) join operation is not supported yet.
(8) test node should be elements.
(9) at present, only three functions, "contains()", "in()", and last() are supported.
(10)'*' representing 'anything' is not supported
With (1), a query like "//researchers/person[@name = "Shin"][@loc = "Bethesda"]" is invalid, even though it is valid in XPath. You have to convert the query to "//researchers/person[@name = "Shin" and @loc = "Bethesda"]"
With (2), a query like "//researchers/person[@name = //salesperson/person/name[@id > "8080"]])" is invalid, since it has a nested predicate inside another predicate.
With (3), a query like "//SECTION/TITLE[contains(../PARA, "XML")]" is invalid since the test node "//SECTION/TITLE" is not an ancestor of the condition node "//SECTION/PARA". You have to convert the query to "//SECTION[contains(PARA, "XML")]/TITLE", where the test node "//SECTION" is an ancestor of the condition node "//SECTION/PARA".
With (4), an operator "|" (Union operator) is only allowed for connecting path expression outside predicates. So the queries "//a | //b | //c" is valid, where the query "//a or //b or //c" is invalid. Inside predicates, "and" and "or" operators are allowed as well as "|". So the query "//a[1 and in("infor*", b) or last( )]" is valid.
With (5), a query like "//a[-1]" or "//a[1.1]" is invalid. The number inside a predicate should be positive. Hence, a query like "//a[5]" is valid. Note that the number and the literal are different with each other as in XPath. A number is a number without enclosed by double quote ("). On the other hand, a literal is a string enclosed by double quote. Hence, 1 is a number, whereas "1" is a literal.
With (6) and (7), comparison inside predicates is limited at present. At present, one argument should be literal for the equality and comparison operator. For instance, a query like "//a[@id > "100"]" or "//a[title = "XPERT"]" is valid, whereas another like "//a[@id=//b/@id]" is invalid. At present, semi-join is not allowed. But it will be supported later.
At present, only element nodes can be test node. Hence a query like "/person/@first-name[@last-name = "shin"]" is invalid. You have to write a query like "/person[@last-name = "shin"]" and search the element content. It will be supported in a later version.
At present, only three functions, "contains()", "in()", and "last()" are supported. More functions are supported later.
You can look at the current definition of XPQL and understand which query is supported at present.
Constructor Summary |
Xretrieve( ) |
Initialize the retrieval object |
Method Summary |
void
|
initialze (String collection )
It reads a portion of index information for the collection into memory. |
void
|
setStemClass (Stemmer stem)
Set a customized stemmer class. |
Vector
|
evaluate(String XPQLquery)
Evaluate the XPQLquery and returns a vector of ElementInfo objects
|
String
|
getElement (ElementInfo ei)
It returns the string content of the element corresponding to ei |
String
|
getElementHead (ElementInfo ei, int length);
It returns the first 'length' bytes of the string content of the element corresponding to ei |
String
|
getFileName (ElementInfo);
It returns the full path name of the file that contains the element. |
Xretrieve( )
public Xretrieve()
- Initializes a new retrieval object.
-
initialize
public void initialize(String coll)
- Initialize the collection name
coll, which reads a portion of index into memory. You have to index the collection prior to calling the method.
public void setStemClass(Stemmer stemmer)
- Set a customized stemmer class, which should be given as the full path name such as "com.futurexpert.xpert.xindex.Porter". Note that you have to use the same class as the one you use in indexing. The class should be a subclass of Stemmer. If it is set to null, the Stemmer (the root class of the stemming classes are used. In fact, Stemmer do not perform any stemming) is used. If you don't call the method at all, then default porter stemmer is installed.
- Parameters:
Stemmer - s subclass of Stemmer
evaluate
public Vector evaluate(String XPQLquery)
- Given a String type XPQL query, the method evaluates the query and returns a vector of
ElementInfo objects. With ElementInfo objects, you can get the score and the content of the element.
Parameters:
XPQLquery
-a String type XPQL query.
getElement
public String getElement(ElementInfo ei)
- The method gets the string content of the element
ei referring to the collection file. The ElementInfo class is defined as:
public class ElementInfo {
…
public int score;
…
}
-
Using these information, the method access the start location where the element is in the collection file (whose extension name is "coll") and get the string content of the element. You can get the similarity vale from score field.
public String getElementHead(ElementInfo ei, int length)
- The method gets the first "length" bytes of string content of the element
ei. If you want to build an XML information retrieval system using XPERT, you can first show the head information of the retrieved element by using this method. And if a user wants to read the whole content, then you may want to callgetElement() later.
public String getFileName(ElementInfo ei)
- The method gets the full path name of the file that contains the element
ei.
If you want to access the file directly, without using getElement() you can use this method and access the XML file that contain the element directly.