com.futurexpert.xpert.xindex
Class Stemmer
java.lang.Object
|
+--com.futurexpert.xpert.xindex.Stemmer
- public class Stemmer
- extendsObject
- The
Stemmer class is the root class of all the customized stemming classes.
The class has three methods, getStem( ), getWodAndResolveCase( ), and getWord( ). The first two are public methods, whereas the last one is protected.
The method getStem()gets the stem of a token. It calls
getWordAndResolveCase(), which resolves the uppercase and lowercase letters and strips off the special
ctaracters from an input token by calling getWord(). If you want to define your own stemmer class, you have to override these public methods to whatever you want.
The Stemmer class is defined as follows.
public class Stemmer {
public String getStem( String str ) {
return getWordAndResolveCase(str);
}
public String getWordAndResolveCase(String str) {
return getWord(str).toLowerCase();
}
protected String getWord( String str ) {
int last = str.length();
Character ch = new Character( str.charAt(0) );
String temp = "";
for ( int i=0; i < last; i++ ) {
if ( ch.isLetterOrDigit( str.charAt(i) ) )
temp += str.charAt(i);
}
return temp;
}
} //class
-
- The Stemmer does not perform stemming, but only strips off the special characters and resolve cases from the input token.
A token is a sequence of strings without any space. Note that Stemmer is not an abstract class, since it is invoked when you set the null to the method
setStemClass()of Xindex or Xretrieve.
You may wonder why you need to define getWordAndResolveCase() as well as getStem() in your stemmer class. It is because XPERT supports wild character ("*").
XPERT calls getStem() method, if a user uses a word without the wild character such as "//SECTION[in("XML", PARA)]". On the other hand, if a user uses wild character inside in() function
such as "//SECTION[in("infor*", PARA)]", XPERT calls getWordAndResolveCase()
for the prefix "infor" instead of getStem().Invoking getStem() for "infor" may return unwanted stem. In fact, it is nonsense to try to get a stem for a prefix of a word.
In conclusion, in indexing,
XPERT always calls getStem()for a token
since the occurrence of "*" in text is just the ASCII code "*". It does not mean "any sequence of characters" in text. On the other hand, at retrieval, XPERT calls getStem( ) for a word without "*", whereas it calls getWordAndResolveCase() for a word with "*", which in turn calls getWord().
At present, XPERT does not support UTF-8 and other Unicode schemes, so you may not be able to hook the stemmer class for a language other than English. The support of UTF-8 or others will be supported in a later version.
Method Summary |
String
|
getStem (String token)
Get the stem of the input String. It calls the resolveCase(), which in turn calls getWord(). |
String
|
getWordAndResolveCase(String token)
Strips off the special characters and resove the case.
|
String
|
GetWord(String token)
Strips off the special characters.
|
-
getStem
public String getStem(String token)
- Get the stem of the input String. It calls the
resolveCase(), which in turn calls getWord(). It is called for every token appearing in XML files in indexing. On the other hand, in retrieval, it is only called when a word without "*" is used inside "in( )" function.
getWordAndResolveCase
public String getWordAndResolveCase(String token)
- Strips off the special characters and resolve the case. It is directly called when a word with "*" is used inside "in( )" at retrieval.
-
getWord
protected String getWord(String token)
- Strips off the special characters from a given token.
Your customized stemmer needs not override this methods, but you have to write down codes similar to this in
getWordAndResolveCase() instead.