com.futurexpert.xpert.xindex
Class XindexDOM
java.lang.Object
|
+--com.futurexpert.xpert.xindex.XindexDOM
- public class XindexDOM
- extends
Object
The XindexDOM
class represents the primary indexing class that uses Apache
Xerces DOM parser. You can index XML documents with creating an XindexDOM object. A simple example is:
XindexDOM xin = new XindexDOM();
You can add an XML file or directory to the collection by calling methods addDocuments().
For instance, you can add the XML files as many as you want by repeatedly calling addDocuments() as:
Xin.addDocuments("one.xml");
Xin.addDocuments("two.xml");
Xin.addDocuments("three.xml");
Xin.index();
- The method index( ) indexes the list of XML files or directories added by addDocuments().
- You can index multiple files by creating a meta-file that has the list of xml files.
The name of the meta-files should end with "xmm" instead of "xml". For instance, we you write:
Xin.addDocuments("xmlfiles.xmm");
Xin.index();
- the indexer reads the name of XML files from "xmlfiles.xmm" one by one and try to index them.
In "xmlfiles.xmm", each XML file name should be separated by spaces or newlines as:
a.xml b.xml c.xml
d.xml
e.xml …
- The paths of file names in "xmlfiles.xmm" can be both relative or absolute.
If a file is given with a relative name, then the path of
the directory of the meta-file is prepended at the front of the file name. For instance, if the path of "xmlfile.xmm" is "/shin/xml", then the full paths of each file
is assumed to be "/shin/xml/a.xml", "/shin/xml/b.xml" and so on
You can also add a directory to the collection. If you add a directory to the collection as:
Xin.addDocuments("/xml/adir");
- all the xmlfiles in the /xml/adir are indexed one by one.
-
And sometimes, you may want to index the xml files recursively down to the sub-directories. You can do that with a overloaded function as:
Xin.addDocuments("/xml/adir", "r");
- You can also control how many levels the index should be repeatedly performed.
There is one another overloaded function for this. For instance, if you write
Xin.addDocuments("/xml/adir", "r", 2);
-
you will index the directory /xml/adir and
subdirectories up to two level down to the directory. For instance, if the directory structure is
like /xml/adir/bdir/cdir/ddir, then all the xml files whose name
ends with ".xml" are indexed if those are in the directory /xml/adir,
/xml/adir/bdir and /xml/adir/bdir/cdir
-
There is one more flexibility in placing the index file. From Version 0.4, you can place
the index file wherever you want with the arbitrary name. In version 0.3 and before, the index file
used to be made in the parent directory of where XML files reside. And the collection name was the
same as the directory name. But from Version 0.4, you can place the index file in an arbitrary position with
arbitrary name. For instance, if you use the method, setIndexLoc( ) as below
Xin.setIndexLoc("/xml/index", "myxml");
- index files are created in the directory /xml/index in the name of
/xml/index In fact, ten index files, myxml.1 to myxml.9 and myxml.11
are created with an XML source file named myxml.coll that puts all the xml content together.
- Here are some more examples of what kinds of options can be given to the class:
Xin.setStemmingClass("com.futurexpert.xpert.xindex.Stemmer");
Xin.setStopWordFile("stopword");
Xin.setIndexUnitSize(500000);
- The class
XindexDOM
provides flexibility in indexing by allowing users choose their own stemming algorithms and stop word file. If you give a stemming class (Give the full path name of the class.) in the method SetStemmingClass(), you can use your own stemming algorithm instead of the built-in stemmer. If you give the null to the method as:
Xin.setStemmingClass(null);
- a stemmer class "
com.futurexpert.xpert.xindex.Stemmer" is used, which does not perform stemming at all,
except that it changes the uppercase letters into lowercase ones and strips off the special characters from a given token. If you don't call the method in your program, the default porter stemmer "com.futurexpert.xpert.xindex.Porter" is installed.
You can have another option to choose your own stop words instead of the default stop words by using the method setStopWordFile().A stop word file is a list of stop words separated by at least a space. Stop words are common words that do not carry significant meaning and thus are excluded in index words. For instance, if you include words "a as for of" in stop word file, then all the occurrences of the words are not indexed. The more stop words are listed, the less terms are indexed, which contributes to less index size.
There is pros and cons for having more stop words. One advantage for having more stops words is that you will get less index space overhead. But at the same time, you should remember that as the words in stop word file are not indexed, you can not retrieve the elements having stop words. In some case, it may cause trouble in supporting phrase retrieval. For instance, if a user wants to retrieve "XML management system" but the word "system" is in your stop word file, you can not retrieve the elements even though the elements contain the phrases.
Another method named setIndexUnitSize(} sets the limit of file sizes that are indexed together. For instance, if you set the size to 500,000 bytes as:
Xin.setIndexUnitSize(500000);
- the indexer tries to index the files whose accumulated file sizes do not exceed 500K together in memory and
compressed the indices into disk. So if the whole size of the files are 10 megabytes, then the indexer compresses each group of files repeatedly and saves them into disk 20 times. Of course, the indices are merged together periodically in the course of indexing.
The larger memory you set to the argument, the more indexes tend to reside in the memory,
which may cause "out-of-memory problem". The default is set to 500,000 bytes (500K). The recommended size is from 500k to 1 megabytes.
If a file size exceeds the limit, then the file is indexed alone and merged later with other files.
Constructor Summary |
XindexDOM( ) |
Initialize the indexing object |
Method Summary |
void
|
addDocuments(String filename (or directory name))
Add the XML file ( or directory) into the collection.
The file (or files in the directory) will be indexed later togetger with other files added by index() method |
void
|
addDocuments(String dir, String recur)
Add the XML directory into the collection.
if String recur is equal to "r", the all the xml files in the sub-directories including the directory itself
will be indexed. |
void
|
addDocuments(String dir, String recur, int level)
Add the XML directory into the collection
with the level.
if String recur is equal to "r", the all the xml files in the sub-directories down to level levelincluding the directory itself
will be indexed. |
void
|
setIndexLoc(String dir, String indexname)
Place the index files in the directory dir with the name
indexname
|
void
|
setStemmingClass(Stemmer stem)
Set the customized stemming class. If it is set to null, then the class Stemmer is called |
void
|
setStopWordFile(String filename);
Set the customized file name.. |
void
|
setIndexUnitSize(int size)
Set the limit of the accumulated file sizes in bytes that are indexed together in memory. |
void |
index()
Perform indexing depending on the parameters given above. |
XindexDOM( )
public XindexDOM()
- Initializes a new indexing object.
-
addDocuments
public void addDocuments(String file)
- Add the XML file ( or directory) into the collection.
The file (or files in the directory) will be indexed later togetger with other files added by index() method
addDocuments
public void addDocuments(String dir, String recur)
- Add the XML directory into the collection.
if String recur is equal to "r", the all the xml files in the sub-directories including the directory itself
will be indexed.
addDocuments
public void addDocuments(String dir, String recur, int level)
-
Add the XML directory into the collection with the level.
if String recur is equal to "r", the all the xml files in the sub-directories down to level
levelincluding the directory itself will be indexed.
setIndexLoc
public void setIndexLoc(String dir, String name)
-
Place the index files in the directory dir with the name
indexname
setStemmingClass
public void setStemmingClass(Stemmer stemclass)
- Set the customized stemming class. The argument class should be subclass of Stemmer.
- Parameters:
stemclass
-a stemming class that is a subclass of Stemmer.
setStopWordFile
public void setStopWordFile(String stopfilename)
- Set the customized stop word file
- Parameters:
Stopfilename - the name of the stop word file
setIndexUnitSize
public void setIndexUnitSize(int size)
- Set the limit of the accumulated size of files that are indexed together in memory. The default is set to 500,000 bytes (500 kbytes). The indexer tries to index the files whose accumulated file size does not exceed 500K together in memory and compressed the indices into disk.
- So if the whole size of the files are 10 megabytes, then the indexer compresses each group of files repeatedly and saves them into disk 20 times. Of course, the indices are merged together periodically in the course of indexing. The larger memory you set to the argument, the more indexes tend to reside in the memory, which may cause "out-of-memory problem". The recommended size is from 500k to 1 megabytes. If a file size exceeds the limit, then it is indexed alone and merged later with other files.
public void index()
- Perform the indexing according to the parameters. (*)Make sure if you have the write permission to the directory where the file is located. As the result of indexing, you will get the 9 index files and
one collection file that has all the XML contents (with entities expanded). For instance, if you index a file "
nt.xml", you will get index files from "nt.xml.1" to "nt.xml.9" and "nt.xml.coll".
The purpose of the collection file is to gather the xml files with all the entities expanded and provides the element contents that are retrieved.
As the result of indexing you will also get ten index files and one collection file as:
"xmllist.1" to "xmllist.9", "xmllist.11" and "xmllist.coll".
- If the argument is a directory name instead of a file name, index( ) reads all the XML files with the
extension ".xml" in the directory and index them together.
In the same as the file case, the names of index files are "xmldir.1" to "xmldir.9", "xmldir.11"
and "xmldir.coll", when the name of the directory is "xmldir".