Indexing Structured Text for Queries on Containment Relationships


Tuong Dao
Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.
tuong@kbs.citri.edu.au

Ron Sacks-Davis
Faculty of Applied Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.
rsd@kbs.citri.edu.au

James A. Thom
Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia.
jat@cs.rmit.edu.au


Abstract

Documents consist of logical components such as titles and paragraphs. The complexity of the structure of documents is captured by document representation standards such as SGML. The GCL (Generalized Concordance Lists) query language has been proposed for collections of structured documents such as SGML documents. It uses containment relationships to provide a simple and effective way to formulate traditional boolean queries as well as queries specifying document structure and provides the flexibility to access, within the same database, documents which conform to multiple hierarchical structures and have different markup schemas. GCL also allows the retrieval of structural elements at any level of the document structure.

The flexibility allowed by the language and its implementation comes with a significant restriction: no recursive structures are allowed. However such structures are present in many SGML documents where components are defined recursively. This paper proposes to extend GCL to allow recursive structures. An implementation framework, based on an interval indexing scheme, is provided to demonstrate that only small extensions are required to support recursive structures.