A communiqué from Eliot Kimber announces the availability of a LuceneXML package and companion LuceneClient package which support indexing of XML documents in a way that enables structure-aware search and retrieval. LuceneXML represents "the initial result of an experiment in using the Apache Lucene package; the implementation is incomplete but sufficient to demonstrate the approach and to enable testing." Jakarta Lucene is Java-based, high-performance, full-featured text search engine suitable for full-text search. The LuceneXML package "provides a manager class (XMLSandRManager) that exposes factory methods for creating XML indexers and searchers. Using the XML indexer, you can add XML documents to a Lucene index. The XML searcher provides convenience methods for submitting XML queries to Lucene... The LuceneClient application lets you index XML documents and submit queries against Lucene indexes."
The lucene_xml package description: "The indexing approach used is to index each element as a separate Lucene document. For element, its directly-contained PCDATA content, its tagname, DOM tree location, ancestor list, and attributes are indexed. Each of these 'element docs' is related to the original XML document by the XML document's 'docid' (e.g., its fully-qualified filename, URL, or repository ID). The XMLIndexer class exposes one method: indexNewDocument(). This method takes the path of an XML document and attempts to index it. The document must be a valid XML document (e.g., you can open the document with IE5). In this implementation, the file path is used as the docid stored in the index. The method returns an IndexingMetrics object, which contains timing and data size information about the document indexed, including the Lucene-specific time, DOM-specific time, number of elements, total nodes processed, and total text content indexed..."
Principal references:
- XML Indexing With Lucene - LuceneXML package description
- Download sample code and interactive client. Includes source code, compiled Java classes, and dependent third-party packages.
- Apache Jakarta Lucene
- ISOGEN White Papers
- "XML and Query Languages" - Main reference page.