com.futurexpert.xpert.xindex
Class Stemmer

java.lang.Object
  |
  +--com.futurexpert.xpert.xindex.Stemmer


public class Stemmer
extendsObject
The Stemmer class is the root class of all the customized stemming classes. The class has three methods, getStem( ), getWodAndResolveCase( ), and getWord( ). The first two are public methods, whereas the last one is protected. The method getStem()gets the stem of a token. It calls getWordAndResolveCase(), which resolves the uppercase and lowercase letters and strips off the special ctaracters from an input token by calling getWord(). If you want to define your own stemmer class, you have to override these public methods to whatever you want. The Stemmer class is defined as follows.

  public class Stemmer {

  	public String getStem( String str ) {
    		return getWordAndResolveCase(str);
    	} 

	public String getWordAndResolveCase(String str) {
    	return getWord(str).toLowerCase();
	}

	protected String getWord( String str ) {
     	int last = str.length();

     	Character ch = new Character( str.charAt(0) );
     	String temp = "";

     	for ( int i=0; i < last; i++ ) {
         		if ( ch.isLetterOrDigit( str.charAt(i) ) )
            	temp += str.charAt(i);
     	}

     	return temp;
  	} 
} //class
 
The Stemmer does not perform stemming, but only strips off the special characters and resolve cases from the input token. A token is a sequence of strings without any space. Note that Stemmer is not an abstract class, since it is invoked when you set the null to the method setStemClass()of Xindex or Xretrieve.
You may wonder why you need to define getWordAndResolveCase() as well as getStem() in your stemmer class. It is because XPERT supports wild character ("*"). XPERT calls getStem() method, if a user uses a word without the wild character such as "//SECTION[in("XML", PARA)]". On the other hand, if a user uses wild character inside in() function such as "//SECTION[in("infor*", PARA)]", XPERT calls getWordAndResolveCase() for the prefix "infor" instead of getStem().Invoking getStem() for "infor" may return unwanted stem. In fact, it is nonsense to try to get a stem for a prefix of a word.
In conclusion, in indexing, XPERT always calls getStem()for a token since the occurrence of "*" in text is just the ASCII code "*". It does not mean "any sequence of characters" in text. On the other hand, at retrieval, XPERT calls getStem( ) for a word without "*", whereas it calls getWordAndResolveCase() for a word with "*", which in turn calls getWord().
At present, XPERT does not support UTF-8 and other Unicode schemes, so you may not be able to hook the stemmer class for a language other than English. The support of UTF-8 or others will be supported in a later version.
 


   

Method Summary

String

getStem (String token)
          Get the stem of the input String. It calls the resolveCase(), which in turn calls getWord().

 String

getWordAndResolveCase(String token)

Strips off the special characters and resove the case.

 String

GetWord(String token)

Strips off the special characters.

   

 

Method Detail

getStem


public String getStem(String token)
Get the stem of the input String. It calls the resolveCase(), which in turn calls getWord(). It is called for every token appearing in XML files in indexing. On the other hand, in retrieval, it is only called when a word without "*" is used inside "in( )" function.


getWordAndResolveCase


public String getWordAndResolveCase(String token)
Strips off the special characters and resolve the case. It is directly called when a word with "*" is used inside "in( )" at retrieval.
 


getWord


protected String getWord(String token)
Strips off the special characters from a given token. Your customized stemmer needs not override this methods, but you have to write down codes similar to this in getWordAndResolveCase() instead.