com.futurexpert.xpert.xindex
Class XindexSAX

java.lang.Object
  |
  +--com.futurexpert.xpert.xindex.XindexSAX


public class XindexSAX
extends Object
The XindexSAX class represents the primary indexing class that uses SAX-2 parsers. You can index XML documents with creating an XindexSAX object. One advantage of this class over XindexDOM is that you can use any SAX-2 parser (SAX version 2) written in Java other than Xerces. Another is that as it does not take the whole XML file into memory, it can process large files no matter how large it is. In fact, XindexSAX is faster than XindexDOM. So it is highly recommended that you take XindexSAX instead of XindexDOM. Similarly as XindexDOM you can create XindexSAX object as:
	XindexSAX xin = new XindexSAX();
And you can associate your SAX-2 parser with the following API:
	xin.setSAXParser("org.apache.xerces.parsers.SAXParser");
or
	xin.setSAXParser(XMLReader xmlReader);
In fact, in the case of SUN SAX parser, you may not be able to pass the string representing SAX parser, with the method setSAXParser(String saxparser). In this case, you can use the following code and call setSAXParser(XMLReader xmlReader);
	javax.xml.parsers.SAXParserFactory saxParserFactory = javax.xml.parsers.SAXParserFactory.newInstance();
	javax.xml.parsers.SAXParser saxParser = saxParserFactory.newSAXParser();
	org.xml.sax.XMLReader xmlReader = saxParser.getXMLReader();
	xin.setSAXParser(xmlReader);

You can add an XML file or directory to the collection by calling methods addDocuments(). For instance, you can add the XML files as many as you want by repeatedly calling addDocuments() as:

 
	Xin.addDocuments("one.xml");
	Xin.addDocuments("two.xml");
	Xin.addDocuments("three.xml");
	Xin.index();
The method index( ) indexes the list of XML files or directories added by addDocuments().
You can index multiple files by creating a meta-file that has the list of xml files. The name of the meta-files should end with "xmm" instead of "xml". For instance, we you write:
	Xin.addDocuments("xmlfiles.xmm");
	Xin.index();
the indexer reads the name of XML files from "xmlfiles.xmm" one by one and try to index them. In "xmlfiles.xmm", each XML file name should be separated by spaces or newlines as:
	a.xml b.xml c.xml
	d.xml
	e.xml …
The paths of file names in "xmlfiles.xmm" can be both relative or absolute. If a file is given with a relative name, then the path of the directory of the meta-file is prepended at the front of the file name. For instance, if the path of "xmlfile.xmm" is "/shin/xml", then the full paths of each file is assumed to be "/shin/xml/a.xml", "/shin/xml/b.xml" and so on
You can also add a directory to the collection. If you add a directory to the collection as:
	Xin.addDocuments("/xml/adir");
all the xmlfiles in the /xml/adir are indexed one by one.
And sometimes, you may want to index the xml files recursively down to the sub-directories. You can do that with a overloaded function as:
	Xin.addDocuments("/xml/adir", "r");
You can also control how many levels the index should be repeatedly performed. There is one another overloaded function for this. For instance, if you write
	Xin.addDocuments("/xml/adir", "r", 2);
you will index the directory /xml/adir and subdirectories up to two level down to the directory. For instance, if the directory structure is like /xml/adir/bdir/cdir/ddir, then all the xml files whose name ends with ".xml" are indexed if those are in the directory /xml/adir, /xml/adir/bdir and /xml/adir/bdir/cdir
There is one more flexibility in placing the index file. From Version 0.4, you can place the index file wherever you want with the arbitrary name. In version 0.3 and before, the index file used to be made in the parent directory of where XML files reside. And the collection name was the same as the directory name. But from Version 0.4, you can place the index file in an arbitrary position with arbitrary name. For instance, if you use the method, setIndexLoc( ) as below
	Xin.setIndexLoc("/xml/index", "myxml");
index files are created in the directory /xml/index in the name of /xml/index In fact, ten index files, myxml.1 to myxml.9 and myxml.11 are created with an XML source file named myxml.coll that puts all the xml content together.
Here are some more examples of what kinds of options can be given to the class:
	Xin.setStemmingClass("com.futurexpert.xpert.xindex.Stemmer");
	Xin.setStopWordFile("stopword");
	Xin.setIndexUnitSize(500000);
	
The class XindexSAX provides flexibility in indexing by allowing users choose their own stemming algorithms and stop word file. If you give a stemming class (Give the full path name of the class.) in the method SetStemmingClass(), you can use your own stemming algorithm instead of the built-in stemmer. If you give the null to the method as:
	Xin.setStemmingClass(null);
a stemmer class "com.futurexpert.xpert.xindex.Stemmer" is used, which does not perform stemming at all, except that it changes the uppercase letters into lowercase ones and strips off the special characters from a given token. If you don't call the method in your program, the default porter stemmer "com.futurexpert.xpert.xindex.Porter" is installed. You can have another option to choose your own stop words instead of the default stop words by using the method setStopWordFile().A stop word file is a list of stop words separated by at least a space. Stop words are common words that do not carry significant meaning and thus are excluded in index words. For instance, if you include words "a as for of" in stop word file, then all the occurrences of the words are not indexed. The more stop words are listed, the less terms are indexed, which contributes to less index size.
There is pros and cons for having more stop words. One advantage for having more stops words is that you will get less index space overhead. But at the same time, you should remember that as the words in stop word file are not indexed, you can not retrieve the elements having stop words. In some case, it may cause trouble in supporting phrase retrieval. For instance, if a user wants to retrieve "XML management system" but the word "system" is in your stop word file, you can not retrieve the elements even though the elements contain the phrases.
Another method named setIndexUnitSize(} sets the limit of file sizes that are indexed together. For instance, if you set the size to 500,000 bytes as:
	Xin.setIndexUnitSize(500000);
the indexer tries to index the files whose accumulated file sizes do not exceed 500K together in memory and compressed the indices into disk. So if the whole size of the files are 10 megabytes, then the indexer compresses each group of files repeatedly and saves them into disk 20 times. Of course, the indices are merged together periodically in the course of indexing. The larger memory you set to the argument, the more indexes tend to reside in the memory, which may cause "out-of-memory problem". The default is set to 500,000 bytes (500K). The recommended size is from 500k to 1 megabytes.
One advantage of XindexSAX over XindexDOM is that if a file size exceeds the limit, then XindexSAX splits the file into the limit size and indexes each separately, whereas XindexDOM tries to index the whole file as it is. Hence, XindexSAX does not raise memory problems in indexing large files.


 

Constructor Summary

XindexSAX( )

 Initialize the indexing object

 

Method Summary

 void

setSAXParser(String SAXParser)
           The method associate SAX parser represented by the argument string with XindexSAX object.

 void

setSAXParser(XMLReader xmlreader)
           The method associate SAX parser represented by the argument XLMReader with XindexSAX object.

 void

addDocuments(String filename (or directory name))
          Add the XML file ( or directory) into the collection. The file (or files in the directory) will be indexed later togetger with other files added by index() method

 void

addDocuments(String dir, String recur)
          Add the XML directory into the collection. if String recur is equal to "r", the all the xml files in the sub-directories including the directory itself will be indexed.

 void

addDocuments(String dir, String recur, int level)
          Add the XML directory into the collection with the level. if String recur is equal to "r", the all the xml files in the sub-directories down to level levelincluding the directory itself will be indexed.

 void

setIndexLoc(String dir, String indexname)
           Place the index files in the directory dir with the name indexname

 void

setStemmingClass(Stemmer stem)
          Set the customized stemming class. If it is set to null, then the class Stemmer is called

 void

setStopWordFile(String filename);
          Set the customized file name..

 void

setIndexUnitSize(int size)
          Set the limit of the accumulated file sizes in bytes that are indexed together in memory.

void

index()
          Perform indexing depending on the parameters given above.

   

Constructor Detail

XindexSAX( )

public XindexSAX()
Initializes a new indexing object.


 

Method Detail

setSAXParser

public void setSAXParser(String file)
setSAXParser() associate the SAX parser represented by the argument string with the XindexSAX object. For instance, if you write as:
	xin.setSAXParser("org.apache.xerces.parsers.SAXParser");
XindexSAX invokes the Apache Xerces SAX parser and tries to index with the parser.


setSAXParser

public void setSAXParser(XMLReader)
setSAXParser() associate the SAX parser represented by the XMLReader argument with the XindexSAX object. In SUN SAX parser case, you may not be able to pass the string. Instead you may have write the code as:
	javax.xml.parsers.SAXParserFactory saxParserFactory = javax.xml.parsers.SAXParserFactory.newInstance();
	javax.xml.parsers.SAXParser saxParser = saxParserFactory.newSAXParser();
	org.xml.sax.XMLReader xmlReader = saxParser.getXMLReader();
	xin.setSAXParser(xmlReader);
XindexSAX invokes the SUN Javax SAX parser and tries to index with the parser.


addDocuments

public void addDocuments(String file)
Add the XML file ( or directory) into the collection. The file (or files in the directory) will be indexed later togetger with other files added by index() method


addDocuments

public void addDocuments(String dir, String recur)
Add the XML directory into the collection. if String recur is equal to "r", the all the xml files in the sub-directories including the directory itself will be indexed.


addDocuments

public void addDocuments(String dir, String recur, int level)
Add the XML directory into the collection with the level. if String recur is equal to "r", the all the xml files in the sub-directories down to level levelincluding the directory itself will be indexed.


setIndexLoc

public void setIndexLoc(String dir, String name)
Place the index files in the directory dir with the name indexname


setStemmingClass

public void setStemmingClass(Stemmer stemclass)
Set the customized stemming class. The argument class should be subclass of Stemmer.
Parameters:
stemclass -a stemming class that is a subclass of Stemmer.
 


setStopWordFile

public void setStopWordFile(String stopfilename)
Set the customized stop word file
Parameters:
Stopfilename - the name of the stop word file


setIndexUnitSize

public void setIndexUnitSize(int size)
Set the limit of the accumulated size of files that are indexed together in memory. The default is set to 500,000 bytes (500 kbytes). The indexer tries to index the files whose accumulated file size does not exceed 500K together in memory and compressed the indices into disk.
So if the whole size of the files are 10 megabytes, then the indexer compresses each group of files repeatedly and saves them into disk 20 times. Of course, the indices are merged together periodically in the course of indexing. The larger memory you set to the argument, the more indexes tend to reside in the memory, which may cause "out-of-memory problem". The recommended size is from 500k to 1 megabytes. If a file size exceeds the limit, then it is indexed alone and merged later with other files.
 


index

public void index()
Perform the indexing according to the parameters. (*)Make sure if you have the write permission to the directory where the file is located. As the result of indexing, you will get the 9 index files and one collection file that has all the XML contents (with entities expanded). For instance, if you index a file "nt.xml", you will get index files from "nt.xml.1" to "nt.xml.9" and "nt.xml.coll". The purpose of the collection file is to gather the xml files with all the entities expanded and provides the element contents that are retrieved.
As the result of indexing you will also get ten index files and one collection file as: "xmllist.1" to "xmllist.9", "xmllist.11" and "xmllist.coll".
If the argument is a directory name instead of a file name, index( ) reads all the XML files with the extension ".xml" in the directory and index them together. In the same as the file case, the names of index files are "xmldir.1" to "xmldir.9", "xmldir.11" and "xmldir.coll", when the name of the directory is "xmldir".