An XML Information Retrieval Model - A Component based Approach.

THE INTEGRATION OF INFORMATION RETRIEVAL TECHNIQUES WITHIN A SOFTWARE REUSE ENVIRONMENT

Forbes Gibb, Colm McCartan, Ruairi O’Donnell, Niall Sweeney and Ruben Leon

Department of Information Science, Strathclyde University, 26 Richmond St., Glasgow, G1 1XH, UK

e-mail: forbes@dis.strath.ac.uk

Abstract

This paper describes the development of an information retrieval model for the indexing, storage and retrieval of documents created in the Extensible Mark-up Language (XML). The application area is the software re-use environment which involves a broader class of documents than can be processed by conventional IR systems. This includes design and analysis documents in Unified Modelling Language (UML) notation as well as textual format, source code, and textual and source code component interface definitions. XML was selected as it is emerging as the key standard for the representation of structured documents on the World Wide Web (WWW) and incorporates methods for the representation of meta-data. We describe a model that is easily customisable since it is based upon an extensible object-oriented framework. This allows the development of an Information Retrieval (IR) architecture that can be easily adapted to cope with the proliferation of XML Document Type Definitions (DTD) that is likely to be a characteristic of the WWW of the near future.

1. INTRODUCTION

The model described in this paper has been developed as part of AUTOSOFT, a European Union (EU) funded project (ESPRIT Project 25762) which is building a prototypical tool to provide support for high level software reuse based on automatic domain generation. The AUTOSOFT project has as its main objectives to:

Define a methodology specially oriented to high level reuse.
Create an application development set of reuse tools, which will help both development companies and end-users to semi-automatically create new applications by reusing high level software elements and defining effective Graphical User Interfaces (GUIs) with fourth generation languages (4GLs).
Test the developed application in an operational context by using it in at least two different domains and by two different development companies.

The project has brought together academic and industrial partners from four EU member states with a history of research in a variety of contributing fields such as domain generation, intelligent querying and information retrieval. These include

Software AG Espana

CSI Piemonte

Universidad Carlos III de Madrid

Forschungszentrum Informationstechnik GmbH GMD

University of Strathclyde

This paper begins by highlighting the importance of software reuse and then describes the broad structure and function of the AUTOSOFT system. Some of the key technologies underpinning the project are then outlined with a particular focus on XML and its relevance to AUTOSOFT. This is followed by a discussion of the object-oriented framework that has been selected to support the development of the underlying information retrieval system, and then by some details of the implementation and the issues raised during the design and development phases.

2. SOFTWARE REUSE

The potential of software reuse has attracted considerable attention in recent years. Developers are faced with rising costs while needing to produce high-quality deliverables for projects of increasing complexity in the face of heightened customer expectations. Recent developments in software engineering practice, such as the use of advanced Computer Aided Software Engineering (CASE) tools and repositories, alongside the emergence of the object-oriented and related component paradigms, mean that it is now technologically and commercially possible to achieve high-level software reuse. By this we mean the reuse of highly encapsulated components, design objects, analysis experience, GUIs and pieces of applications. Software reuse may be focused on the exploitation of internally developed systems components or may make use of the emerging componentware market, where prices are often a fraction of the costs of homegrown development.

The two key benefits of software reuse are that:

Reuse of components that have already been tested provides higher guarantees of robustness and reliability in any future implementation.

Reuse of components should lead to faster development times and lower costs.

However, it is important to emphasise that the development of technologies to facilitate software reuse is only part of the solution. Opportunistic or ad hoc reuse, i.e. re-using software that just happens to be available in an organisation, is unlikely to deliver the cost and time savings that are desired [1]. Previous studies [2,3] have emphasised that there is a need to tackle the non-technical aspects of software reuse and adopt a holistic approach to software design and engineering. This implies that organisations must adopt a culture in which software is designed with its possible reuse in mind and with the intention that proprietorial attitudes over its authorship are eliminated. It is reported that experience with a range of software libraries has shown that "factors above 30% re-use are difficult to achieve with ad-hoc re-use alone" [4]. Simply creating repositories from legacy software, which has not been documented or designed for potential reuse, is therefore likely to be counter-productive.

There are many changes that organisations may have to implement to tackle these cultural and management issues. These include adopting shared terms, definitions, values and processes between the engineers and the customers of a system. Motivational issues such as reward systems for component reuse and transparency about job security will also need to be addressed. Davenport [5] highlights the problems that are created when new meanings for key business concepts, such as customer, proliferate. Organisations need to create a common information architecture that is built by key stakeholders and not just by information engineers. For instance Xerox experienced problems with traditional information engineering approaches when they set out to define common terms for key information elements. They then set up a Working Group of fifteen marketing and sales managers and their IT counterparts from their world-wide operations to develop an agreed definition of a customer on the basis of which eleven further customer-oriented terms were later developed.

AUTOSOFT recognises the need to adopt a reuse culture and has therefore proposed a technical solution that incorporates tools to develop shared models of the domain(s) for which a software component has been designed. The importance of domain engineering has been highlighted in a major benchmarking study of reuse in the United States [6] and is clearly one of the areas where information science can contribute to the development of effective reuse systems. A key characteristic of AUTOSOFT is that it incorporates a range of tools for indexing, classification and retrieval of the wide range of information sources that are encountered and created within a software development project.

More specifically, an indexing module is used to extract and normalise single and multi-terms from English and Spanish textual documents and high-level pre-defined software project elements. The module supplies these terms as candidates for inclusion in a thesaurus which is constructed for each domain as well as supporting free text, weighted term retrieval. The use of a controlled vocabulary thesaurus and a free text retrieval system are viewed as complementary. There will be an inevitable lag between the identification of new concepts and their inclusion in the thesaurus. Free text retrieval will support queries based on non-controlled terms from any asset as it is added to repository, irrespective of whether the thesaurus has been updated. As noted below, indexing is also applied to a wider range of documents than are conventionally encountered in an IR environment and there is, for instance, a software indexing module which has the functionality to identify classes, attributes and hierarchies. Again this reflects the conclusions of the Department of Defense (DoD) study [6] which found that exemplar organisations reused, and therefore needed to index and retrieve, more than just software code.

It is envisaged that the AUTOSOFT tools will initially be employed to facilitate reuse of internally created components by the parent organisation. However the componentware market will become an equally important user of the type of tools being developed by AUTOSOFT. One probable scenario is that the componentware market will consist of many thousands of small, niche market companies in addition to the traditional mainstream software houses. Many of these will be virtual organisations who can exploit the potential of the WWW to support distributed development teams whose output is entirely digital and who no longer need to rely on the traditional distribution channels. The fragmented nature of the sector will lead to the emergence of component brokers and related intellectual property shops. Conventional search engines will not meet the needs of these players. Instead they will require search engines geared to the retrieval of software rather than those which focus primarily on web documents. They will also need domain models that can be browsed to assist end customers in identifying relevant components for their systems and applications.

There are a number of recently completed or ongoing projects that are investigating the issues associated with software reuse (for an overview of European projects see [7] and an evaluation of selected projects see [8]) some of whose goals overlap with those of AUTOSOFT. These include: REBOOT (Reuse Based on Object-Oriented Techniques) [9,10] which led to the development of a comprehensive methodology for software reuse [11]; SALMS (Software Asset Library Management System) [12]; EUROWARE (Enabling Users to Reuse Over Wide Areas) [13]; and STARS (Software Technology for Adaptable Reliable Systems) [14,15,16]

3. AUTOSOFT AND SOFTWARE REUSE

3.1 Overview

AUTOSOFT has been designed to facilitate the reuse of software components by exploiting information generated at the design, analysis, modelling, and implementation phases of a software engineering project. Other approaches [17] recognise the richness of information contained in the documentation associated with such projects but AUTOSOFT goes further by drawing on as comprehensive a range of information sources as possible. It aims to achieve this by bringing together recent advances in the fields of domain analysis and automatic classification in conjunction with known and tested techniques in the area of information retrieval. The application of such IR techniques to storage and retrieval of objects as diverse as design descriptions, analysis models in Unified Modelling Language (UML) and source code is the key innovative feature of AUTOSOFT.

IR tools are also used to provide input to the domain generation modules within AUTOSOFT (for a fuller discussion of the principles behind this domain generation see [18,19]. Domain analysis is important for producing a framework of shared concepts that can help a software engineer identify suitable components for an application. A domain is a grouping of key abstractions that are common to a specific area of development. These may include components, objects, actors and the relationships between these. Domains are commonly used to analyse a business function or application area with a view to creating a domain model that will allow and encourage the reuse of components, sub-systems and knowledge. Thus, domain analysis might be applied to the marketing process of an organisation in order to identify common classes, task scenarios and activities that could be useful for constructing future marketing solutions. For software development specifically, a domain expert performing domain analysis may choose to gather source code, documentation, designs, manuals, test plans, requirements documents and documents containing domain-specific knowledge and use all of these assets to construct a domain model. AUTOSOFT is concerned with automating this process of domain construction as far as possible and enabling users to apply reuse at all stages in the software development process, from analysis and design through to construction of code. This is in contrast to most established techniques in software reuse, which focus largely on the reuse of low-level code.

It should be noted that these domain models will evolve over time and AUTOSOFT will therefore provide tools for maintaining as well as editing the models. This is particularly important given the emphasis placed on Business Process Engineering (BPR) and the need to support re-engineered systems [20,21]. The relationships between activities may need to be refined or altered following a BPR project and these changes will need to be reflected in the relevant business models and the underlying system models. BPR and software reuse are therefore closely linked. Increased costs and risk often go hand in hand with radical change, and effective reuse methodologies are therefore essential if these are to be kept within acceptable limits. Adopting OO design methodologies is one approach that can help to address these problems.

The OO paradigm differs from traditional techniques by basing software design on objects in the real world (e.g. a student) which consist of both data and the methods (i.e. procedures) that are used to process it. For instance, a student object might consist of data such as personal information that would indicate, for instance whether they are home-based or overseas students, and methods to process this data such as calculate fee. Objects communicate by using messages which call services from other objects, such as calculate outstanding fees. Classes of objects can be sub-divided into hierarchies in which the classes can share inherited characteristics. For instance a student class could be implemented as an inheritance structure containing both undergraduate and postgraduate classes, each of which could inherit the method used to calculate fees from the parent class.

The OO approach (for which AUTOSOFT is principally designed) is recognised to facilitate software reuse [22] as it removes the limitations imposed by monolithic, tightly coupled software development. AUTOSOFT should therefore also be suitable for supporting BPR initiatives as generic objects can be (re)-incorporated into code with greater ease than software written using non-OO approaches.

As a general design principle software developed within the AUTOSOFT project conforms to the guidelines presented in the Reusable Asset Methodology (RAM) [23] that was produced at an early stage of the project. This methodology is intended to support the development of reuse practice and - among other recommendations specific to the exploitation of the AUTOSOFT repository - supports many classic object oriented (OO) design principles such as designing for extensibility and high encapsulation.

The approach of treating many different asset types within AUTOSOFT implies a broader definition of document than is typical in IR. This may include documents with textual, modelling and programmatic information. Table I contains a simple taxonomy of these document types.

It is the authors’ contention that all of the assets described in Table 1 contain information that is useful in the processes of both domain analysis and software reuse. Software code, for example, is not only required for the direct application of code reuse but also may contain embedded domain information in the form of comments. However, as noted above, the effectiveness of the system will be in part dependent on an organisation adopting a reuse philosophy as comments and documentation will only be useful if they are written with reuse in mind.

Document Category Description

Analysis documents Produced during the analysis phase of the development process, these will typically be specifications and requirements documents but may also include documents with embedded domain knowledge and analysis. This category of document is composed of textual documents and potentially documents drawn up in UML

Design documents Typically, documents concerning and describing the proposed or actual technical function of software. Again, both textual and UML documents may be present in this category.

Source code This category represents all software code generated during the development process and, potentially, associated test cases. This includes textual source code with its embedded comments and interface descriptions for components.

Table 1: A simple taxonomy of documents treated by AUTOSOFT

Development and implementation of the AUTOSOFT system is currently being completed (see Figure 1 for the system architecture). The component architecture and object-oriented framework have led to the adoption of Java and C++ as the development languages. The target platform is Microsoft Windows NT 4.0 and therefore Distributed ActiveX Component Object Model (DCOM) has been chosen as the component model. The main modules in the AUTOSOFT system, which are distributed in a Client-Server architecture with DCOM components [24] are:

A Domain Generation System. This module is used to create the concept structure that is used to organise the high and low level components that will be stored in the repository for reuse. The domain generation is based on text documents, source code written in C++ and Java, and UML models. It is supported by a classification sub-system that uses a range of classification algorithms to identify and store the relationships that are to be incorporated in the thesaurus.

A Domain Maintenance System. This module allows an expert to edit and modify the domain by adding or removing terms and relationships.

An Indexing System. This is responsible for extracting, normalising, and relating terms from textual documents and high-level pre-defined software project elements. This module supports both single and multi-term indexing and is designed to work with English and Spanish source documents. It is used to generate candidate terms for inclusion in the thesaurus as well as to support free text, weighted term retrieval. A software indexing module is also being developed that has the functionality to identify classes, attributes and hierarchies.

A Components Reuse Tool. This module allows a software engineer to retrieve components from the repository and create a new software project with them. A software engineer will be able to create a new software project by querying the repository database, selecting high level components from the repository, linking the high-level components with low-level code components, and presenting the models to the user.

The AUTOSOFT architecture, shown in Figure 1, incorporates a set of distributed components collaborating to provide an integrated reuse system. Briefly, the different types of document shown in Table 1 are supplied to the indexing system and then stored in a repository. In parallel with this process a domain model is semi-automatically generated from these source documents under the supervision of a domain expert. This domain is expressed in a thesaural structure and where possible, source documents are assigned classifications in this domain by the referencing and classification subsystems. The two representation techniques are viewed as complementary as: there will be an inevitable lag between the identification of new concepts and their inclusion in the thesaurus; and free text retrieval will support queries based on non-controlled terms.

The indexing process extracts semantic, domain-related information from the various documents and puts it to several uses. For textual sources the contents are parsed into tokens or terms and stored in a retrieval-oriented index. At the same time, these terms may, under certain criteria, be passed as candidates for inclusion in the thesaurus. Software and diagrammatic sources have all their internal text and comments extracted and treated in the same fashion. Some structural information is also extracted from source code files in the form of classes, methods, and attributes, and their inheritance hierarchy. Classes refer to domain objects (e.g. a bank account), methods to an operation on that object (e.g. make a deposit), and attributes to data related to an object (e.g. a balance). The inheritance hierarchy provides information on the relationships that exist between objects (e.g. that the object bank account can have two specialisations: a deposit account and a current account). Such relationships have clear importance for establishing thesaural relations within a domain. Again, these data are stored in a retrieval index and may be passed as sets of candidate terms and relationships for inclusion in the thesaurus. Classification of the asset occurs after this information extraction process. Searches may therefore be performed upon the repository by using free text, weighted term retrieval or via the thesaurus.

Central to this architecture is the repository of reusable assets (i.e. the document collection) that must have their information content extracted for indexing and thesaural classification. As shown in Table 1, the repository may consist of:

Design and analysis documents in a textual format

Design and analysis diagrams in UML notation

Source code, potentially in several different object-oriented 4GLs with embedded textual information.

Textual and source code component interface definitions.

The IR system is therefore required to store and retrieve structured documents in a format flexible enough to represent the diversity of document types shown above and, ideally, expressive enough to provide a representation of source code. Whether implicit or explicit, it seems that these documents also have a level of structure varying from relatively low, in the form of textual documents to the highly structured formats represented by source code. The heterogeneous nature of such a collection implies that the indexing and retrieval processes should be independent of the document type. This will allow the development of a system that is as generic as possible while allowing the easy integration of new media and document types in the AUTOSOFT repository. These considerations have led to the adoption of XML as a format for presenting document surrogates to the indexing process. As discussed below, XML provides a mechanism for reducing many diverse types of document to a structured representation expressed in a mark-up language.

4. STANDARDS AND AUTOSOFT

AUTOSOFT has made extensive use of established and emerging standards as part of its development environment. This will ensure as open and flexible a technology base as possible although it is recognised that not all the standards are fully ratified or fully developed. There has been a particular emphasis on standards that will facilitate information exchange across the Web. The role of each of the key standards is discussed below.

4.1 XML

XML has generated considerable attention in the technical press recently where it has been widely referred to as the replacement for HyperText Mark-up Language (HTML). The two languages are, of course, related and share a common parent in the Standard General Mark-up Language [25]. SGML is a complex standard in the form of a meta-language which allows an author or publisher to describe how a document is structured by means of a Document Type Definition (DTD). A DTD specifies which tags a document may have, what they are composed of, and how they are related to one another within the structure of the document in terms of sequence, nesting, etc. These tags structure text into headings, paragraphs, lists, hypertext links etc. For instance, in Figure 2 the tags <HTML> and ></HTML> indicate the start and end of an HTML document, <HEAD> and </HEAD> the start and end of the head of an HTL document, <UL> and </UL> the start and end of an unordered bulleted list, and <LI> a list item.

<HTML>
<HEAD>
<TITLE> Course 475: Software List </TITLE>
</HEAD>
<BODY>
<CENTER>
<H1> Course 475: Software we will use </H1>
</CENTER>
<UL>
<LI> Microsoft NT
<LI> Netscape Navigator Gold
<LI> Microsoft Internet Explorer
<LI> Office 97
<LI> Cold fusion
<LI> Front Page 97
<LI> Other…
</UL>
</BODY>
</HTML>

Figure 2. An HTML Document Showing HTML Tags

HTML was originally a DTD of SGML but was later extended to include tags that defined both structural (i.e. the elements contained in a document such as Head, body, title and paragraph) and presentational features (i.e. how these elements appear, such as bold, italic, and colour). Awareness of the limitations of HTML and the desire to avoid the proliferation of proprietary HTML extensions led to the ratification of XML by the World Wide Web Consortium [26] in February, 1998. In terms of complexity, XML lies somewhere between SGML and HTML and is aimed at providing most of the richness of the SGML command set while remaining easy to learn, implement and use. It has been described as a dialect of SGML and is itself a meta-language.

Probably the most important departure from HTML is that XML specifies the structure of a document via its accompanying DTD and is robust enough to describe a wide range of abstract structures. Since the author of the document can define this DTD, any customised tags may be defined and used within the document. XML can therefore be used to describe data objects, structured records, and many other types of structured data. The industry has been quick to appreciate the potential of this new standard for exchanging and sharing documents on the Internet and intranets, and also for sharing a huge variety of structured data using established Internet standards such as HyperText Transfer Protocol (HTTP) as the mediating protocols.

The IR community has recently focused on the problems associated with collections of structured documents and the merging of information and data retrieval techniques. These range from the problems of interoperability [27] to issues to do with the adoption of relational database technologies [28,29]. In most cases [30,31,32,33] it has been necessary to exploit a proprietary mark-up system with the attendant costs of document preparation and the problems typically associated with sharing documents between formats. More successful approaches have focused on SGML and have suggested that the retrieval of documents with explicit structure is a feasible and useful goal. However, the complexity of SGML has often been cited as a disadvantage and XML seems to hold out the promise of a simplified and more usable document model.

XML offers a potentially industry-wide, customisable format with the robustness and flexibility to model a huge range of document types. It seems safe to suppose that support for XML in the form of toolsets and application suites will be forthcoming from an industry increasingly accustomed to, and demanding of, interoperability between vendors’ products [34].

4.2 AUTOSOFT and XML Meta-data Interchange (XMI)

Although authors may specify their own tags in XML by creating their own DTD, this is a non-trivial task and there is much to be gained by having standardised, public DTDs as exemplified by HTML. XMI was developed as a response to the Object Management Group (OMG) request for an XML-compliant data transfer format [35] that would allow users to exchange whole or partial object models and other meta-models. XMI attempts to integrate XML, UML and the Meta-data Object Facility (MOF). UML is the OMG object and business modelling standard while MOF is their standard for describing meta-data and repository content. Among the proposed standards under the XMI umbrella, there is a full DTD allowing UML documents to be exchanged via XML. In addition, most UML modelling tools support the saving of their UML models as XML documents.

This language and platform-independent format has obvious benefits to the AUTOSOFT project, given the varied nature of collections that were described previously. To complete the broad picture of the system’s functionality, a document in the AUTOSOFT repository is mapped onto an XML-compliant document that will then be indexed by the indexing and retrieval system. XML is judged to be ideal for the representation of source code as it provides support for all UML diagrams through its UML DTD. A DTD has been designed within AUTOSOFT for representing information about indexable assets. Note that the system does not require that users adopt XML in the preparation of their documents since this mapping will be a pre-processing phase for the document collection. Not only are modules for the conversion of several object-oriented 4GLs under development, but XML is being supported by the major software vendors. For instance Oracle has released 8i, an XML compliant DBMS, while IBM and Rational Software (who also developed UML - see below) are using XMI to bridge between IBM's VisualAge for Java and Rational Rose, a software modelling package. This raises the possibility that the user could simply export their designs directly into XMI for treatment by AUTOSOFT.

The incorporation of XML into the AUTOSOFT approach, along with the typical challenges associated with developing complex software systems has led to the adoption of an object-oriented development framework, designed for the field of IR, which is discussed below in Section 5.

4.3 Unified Modelling Language (UML)

Models are used routinely by software engineers to provide a systematic description of the requirements of a piece of software. These models assist them to write efficient and reliable code but also play an important role in communicating the complex relationships between the components of an information system. As models exploit graphical languages for representing these relationships they are accessible to end-users as well as engineers. They can therefore be used to prototype and refine systems based on improved and more detailed understanding of the needs of users. Various methods for modelling systems have been developed over the years, each of which has its own notation, conventions and tools. UML is an attempt to put an end to what Eriksson and Penker refer to as the "method wars" [36] and draws on the pioneering work of Booch [37] Jacobson [38] and Rumbaugh [39] in the domain of OO software engineering. UML consists of four main elements [36].

Views: It is usually impossible to describe a complex system in a single, understandable graph. UML therefore uses views (high-level abstractions) to show the different aspects of the system that are being modelled. Each abstraction consists of a number of diagrams.

Diagrams: Diagrams are the graphs which show how model elements are arranged, related, etc., as part of the system.

Model elements: Model elements represent OO concepts such as classes, objects and messages, and the relationships between these concepts.

General mechanisms: General mechanisms provide additional information about a model element.

UML is designed for modelling of systems, software and businesses and therefore contains both high-level and low-level analyses of how systems operate and co-operate. Models are therefore regarded within AUTOSOFT as an important source of graphical and textual information regarding the functionality, look, performance, etc., of a piece of software which can be exploited for the purposes of domain generation and retrieval of relevant software components.

The interdependency of these standards is perhaps difficult to visualise and a simplified example is used to below to indicate how it is possible to transform a software model expressed in UML into what is essentially a tagged document. Figure 3 shows the UML notation used in this example and later to discuss the FIRE framework.

Figure 4 uses UML to show two classes associated with one another via a generalisation or "is-a" relation. Some class attributes and a method or "operation" are indicated.

Transformation of this model to XMI should result in no information loss, allowing the model to be freely interchanged with other parties or, more importantly for us, allowing the tagged representation of the model to be analysed, indexed and retrieved. Figure 5 shows a simplified expression of the model in XMI. Some headers and tag prefixes have been removed to increase readability.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE XMI SYSTEM "uml.dtd">

<Class xmi.id="_1.2">
   <Name>Car</Name>
   <Feature>
     <Attribute>
       <Name>color</Name>
       <Visibility xmi.value="private"/>
       <OwnerScope xmi.value="instance"/>
     </Attribute>
     <Attribute>
       <Name>engineSize</Name>
       <Visibility xmi.value="private"/>
       <OwnerScope xmi.value="instance"/>
     </Attribute>
   </Feature>
</Class>

<Class xmi.id="_1.3">
   <Name>BMW5</Name>
   <Feature>
     <Operation>
       <Name>start</Name>
       <Visibility xmi.value="public"/>
       <OwnerScope xmi.value="instance"/>
     </Operation>
   </Feature>
</Class>

</XMI>

Figure 5. Simplified XMI representation of class model.

It is not important to fully understand the details of Figure 4 but it can be seen that, after a normal XML file header which tells the XML processor that the UML DTD is being used, the file contains the descriptions of the two classes and the relationship between them. This hopefully gives an idea of the power of XML to communicate arbitrary structures once a suitable metalanguage such as the UML DTD has been agreed.

5. The FIRE Framework

5.1 Overview

The FIRE (Framework for Information Retrieval) framework is an object oriented information retrieval programming model, originally developed at UBILAB Switzerland in co-operation with Robert Gordon University, Aberdeen. The FIRE framework was developed with two aims in mind: firstly as a general development model for all types of IR systems and secondly to support the experimental evaluation of indexing and retrieval techniques in those systems. A comprehensive description of FIRE is given in Sonnenberger and Frei [40]. A brief description of the FIRE model and its customisation within the AUTOSOFT project is provided below together with a discussion of the advantages of the model both for generic IR development and for our specific needs.

When we state that FIRE is an IR Framework, by framework we mean "a collection of classes that provide a set of services for a particular domain; a framework thus exports a number of individual classes and mechanisms that clients can use or adapt" [37,41]. From a developer’s point of view a framework can be viewed as a programming skeleton that defines the basic concepts of the application domain and how the components in the domain are interrelated. Figures 6-8 depict some of the structures in the FIRE model and are represented in the UML notation that is explained in Figure 3. The classes in Figures 6-8 are simplified representations of the elements in the FIRE model and should only be used as an aid to understanding the model. A comprehensive overview of the framework’s classes including attributes, operations and hierarchies is given in works by Sonnenberger and Frei [40, 41, 42].

The design of FIRE uses an OO model and as such is should be implemented in an OO language. The objects in FIRE represent basic IR concepts, their functionality and the interactions between them, e.g. between an index and the retrieval functionality. In this overview we aim to describe the framework, rather than provide an enumeration of the class hierarchies or detailed instructions for its use.

In FIRE each indexable element is represented as a document. This can correspond to a real world document or its surrogate; for example, a web page. Each document consists of one or more document features (simply called features in this paper). These correspond to either structured or unstructured parts of the document and may be in the form of different media (text, pictures, etc.). The FIRE representation of the document is created from the real document by parsing it and associating elements of the document with the corresponding FIRE elements. This can be programmed behaviour (that is, a different parser is required for each document type) or the document itself can provide information about the features it contains using some form of document mark-up; for example XML.

In Figure 6 JournalPaper is an example of a FIRE document. In this case the type of document is a journal paper which contains a title, an author, the author’s picture, an abstract, the main text, and a set of references. Here the title, author, abstract, main text and references will all be treated as separate sections in the document (for indexing and retrieval purposes) while the author’s picture, although a separate feature, will be linked to and hence retrieved as part of the Author feature. FIRE allows each feature in the document to be treated as a separate section or as a convenience to aid document indexing. In OO terminology JournalPaper is a concrete subclass of the FIRE document class. This document class provides methods to allow each of its features to be accessed.

Each feature can contain more features; that is, it can be a set of features. Each feature that is not a set will contain one or more data elements that contain the actual data associated with that feature. Data elements for each programmatic type of data are defined, that is, data elements for each type of data in the application e.g. strings, integers, images, etc. For example, the "Author" feature could contain a string data element holding the author’s name, while the "Abstract" feature could contain a set of string data elements that are all the words in the abstract. Figure 7 shows how these features are defined.

Each feature is indexed according to indexing modalities specified by the user (as opposed to the developer) which are stored in a set of indexing parameters. Each index parameter states which index feature (the data to be stored in the index) is to be used for the document feature data. Creating the index feature from the document feature value transforms the data to a form that can be stored in the index. It also states in which index this feature will be stored. Figure 8 shows the relationship between these features.

The example in Figure 8 shows how a document’s author is to be stored in the AuthorIndex as the author’s name but in upper case. Different indexing features could be used to index the document feature in different ways. For example a Soundex [43] indexing feature could be used to index the authors’ names to allow phonetic types of queries for the author, while the index represented by AuthorIndex could be a plain text file. Note that the classes in the example have been greatly simplified to aid understanding, for example, details of the operations of the classes have been removed.

Using this indexing functionality, a system based on FIRE can index each of the documents by indexing each of its features. The FIRE framework does not impose any particular indexing sequence on the developer thus allowing features to be indexed in parallel and into more than one index if required.

Retrieval in FIRE is handled by a method that is best described as the reverse of the indexing method. For a given query the parts of the query corresponding to the different features of the indexed document need to be identified. Some method of identifying these must be implemented. The authors of FIRE suggest that the developer should use a template that can be filled in by a user [40]. A set of retrieval modalities identifies how to search for each of these query parts (or features).

The query is similar to an incomplete document in FIRE. That is, it can contain one or more of the features as the original document. The retrieval modalities specified by the user define how each query feature is to be located in the index or indexes. They contain information about which index should be searched for each query feature and how the query features are transformed to corresponding index entries. Index entries are the values in the index of the indexed features from a document. For example, a stemmed word could be the index entry corresponding to a word from a textual part of the document (e.g. the BodyText feature).

The retrieval modality also specifies the matching method to be used for the retrieval. In FIRE the matching method is defined in classes of MatchingParameters. These specify how matching is to be performed between IndexEntries (from the index) and the query features from the query. For example, in the author example given above, the matching methods used could be a case insensitive match and a Soundex match. These MatchingParameters also define how to assess how close a match is, that is, they assign a numeric score to the closeness of the match.

The retrieval modalities in FIRE allow more than one matching method to be used for any query/document feature combination and for more than one index to be searched for possible matches. Again, as with indexing, FIRE does not impose a retrieval process on the developer as it is the system user who can specify how query features are to be matched against indexed assets.

FIRE is designed such that the notions of extension and customisation are inherent in the structure of the framework. Indeed, a certain level of customisation is always necessary since FIRE is simply a framework that requires a specific instantiation for a given application (document type). These design considerations provide a powerful and flexible basis for development.

5.2 AUTOSOFT and the FIRE framework

FIRE is a robust model for IR applications and as such can be regarded as at least comparable with other models for a general IR system. Since this design has never been fully implemented, it is of interest to the authors to test and evaluate FIRE as part of its utilisation within the project. However, our reasons for selecting FIRE were also closely related to the requirements of the AUTOSOFT project and the other development strategies selected in the project.

One of the requirements of the AUTOSOFT project is to develop a set of distributed, collaborating components. This component approach applies at the macro level of the project development and to individual parts of the indexing and retrieval engines. Specifically, the indexing engine needs to allow different algorithms to be used to construct entries in the textual document index. This corresponds, for the most part, to allowing different word stemmers to be written and slotted in to the system without too much effort. By this we mean that the rest of the system must remain unaffected by the change and should not need recompilation. A component approach fulfils this design criterion and the FIRE framework is ideally suited to this for two reasons.

Firstly, being object oriented it allows interfaces between each part to be rigidly defined so that new parts can be added without fear of affecting the functioning of other parts as long as that new part conforms to the interface of the part it replaces. Of course, this can be true of all systems developed with OO technology.

Secondly, FIRE forms links between functional parts by means of the indexing and retrieval modalities. These effectively create soft links between the parts of the FIRE framework concerned with: representing the document contents; storing those contents in an index; the method for transforming the contents to the form that is to be indexed; and the matching to be used to retrieve the document. By soft links, we mean that at application construction the exact method for indexing any given document feature or retrieving one of its features need not be defined. Depending on the programming method used, these soft links can just be functional or can extend to the run-time environment where the application is unaware of what parts (classes) it requires until they are specified with an indexing or retrieval modality. It is these soft links that allows a key requirement of AUTOSOFT to be fulfilled, namely allowing the behaviour of the feature indexing and matching to be changed after the system has been constructed without reference to any existing indexing or matching functionality in the system.

Another requirement is created by the choice of XML as our document representation language. Because of this, a model is preferred that ties closely with the document structure of XML for ease of development and for a consistent modelling approach to be taken in the IR system. In examining FIRE one can see how XML elements and FIRE document features play a very similar role in the overall document structure. By allowing a user to specify the mapping between the XML entities and FIRE document features the system can index any XML document providing a FIRE document feature exists that can represent the information contained by that XML entity.

6. ISSUES AND CONCLUSIONS

The building of this prototype raises many interesting issues concerning both design and implementation. FIRE is a flexible and powerful framework designed for general application but this could be said to be both its strength and weakness. The layers of abstraction that it incorporates and that give it its flexibility can lead to efficiency problems. The framework has had to be extended and modified to fit the requirements of AUTOSOFT. In particular the relationships between the classes have been simplified and matching definitions have been altered to improve retrieval performance. Ongoing work therefore focuses on evaluating the FIRE framework in two specific areas, namely, scalability and effectiveness.

An interesting aspect of the implementation is the choice of storage technology. Although Object-Oriented Database Management Systems (OODBMS) are maturing, early systems were perceived as having performance and scaling problems. This consideration, in combination with a project-specific requirement to integrate with legacy database systems, led to the selection of a relational DBMS within AUTOSOFT. Therefore a project constraint was that the chosen development environment assumed that the permanent store for indexed information would be an RDBMS rather than an OODBMS.

Prototype implementations of the FIRE framework used ObjectStore as the underlying OODBMS to store the classes representing the document contents directly [44]. Having decided to use an RDBMS we had choose between:

Whether to create a direct representation of the objects in the database structure (effectively storing a representation of the document) and to match as closely as possible the prototype implementation;

or to store only the minimum information required for each document to allow retrieval to take place.

We chose the latter option as, although this brings the loss of some flexibility because the mapping between objects in the system and the permanent store is quite rigid, there is also a projected increase in efficiency. This is because the amount and form of data stored is specifically tailored to the efficient retrieval of documents matching a query and does not try to model the document in the store. It should be noted that the flexibility of the FIRE framework provides scope for us to change the underlying store to be something other than an RDBMS in the future.

The original FIRE framework was targeted at users who would develop classes to represent the content and structure of each type of document to be stored [48]. We envisaged this as being too limiting for XML documents (whether web-based or not) as their structure can change quickly and there is an infinite number of ways of defining the structure of a document for any given application. Instead of trying to coerce a diverse set of XML documents to fit into a limited set of application-specific document definitions we changed the framework to allow any structure of document to be represented. This is done by allowing users to specify at run-time the mapping between XML elements and the document features of the document. The XML parser then performs this mapping and creates documents with the document features representing the XML elements. As long as the user has defined how to index all the these features the system can index the document. This flexibility will allow any type of XML document to be indexed in the future without requiring any programmatic changes in the system so long as the data type of the XML element can be indexed into the permanent store.

Although FIRE had been prototyped before, no full implementation has ever been developed and applied to large-scale document collections. As it is of interest to the authors to explore the application of AUTOSOFT’s IR subsystem to web-based collections - which are potentially very large - it would seem that demonstrating FIRE’s scalability is a vital factor in appraising the retrieval engine’s usefulness. Specifically a later paper will investigate whether the switch to an RDBMS and the simplification of the class structure in the model increases the efficiency and speed to allow the system to work with large collections such as Text Retrieval Conference (TREC).

Information retrieval provides a classic metric for judging the effectiveness of a system through the measurement of precision and recall. These standard benchmarking measurements will be taken and compared with other systems. As previously noted, this process is aided by FIRE’s flexibility and modular design which provides for interchanging different indexing and retrieval methods to achieve optimal performance. However, although precision and recall are important benchmarks, reliance on them has been recently questioned [45] and evaluation should therefore also look at the value that is generated by an information retrieval system rather than focusing purely on how close it gets to delivering an ideal in terms of precision and recall. The DoD [6] surveyed 9 major commercial organisations, ranging from AT&T to Texas Instruments, and 6 large Government agencies and identified a number of key metrics which could be applied to software reuse:

Return on investment

Reduced cost

Cycle time reductions

Ability to deliver on or ahead of schedule

Reduced errors and risk

Ease of maintenance

Reduction in lines of code supported

Increased confidence amongst managers and customers

Reduction of stress amongst employees

Lead over competitors

These findings were supported by a survey of 9 UK commercial organisations undertaken by the project team [46]. Feedback from the commercial partners as to the effectiveness of the prototype in terms of the above measures of business value will be therefore at least as important as those for precision and recall.

Other avenues for future work include the development of a web-browser based interface to the retrieval engine and expansion of the functionality to incorporate different XML DTD’s. As noted above the design of the system will allow most XML DTD’s to be incorporated with out any programmatic changes to the system. Currently it is envisaged that the configuration of the system can be set using an XML file that describes the mapping between XML elements, documents features and indexing features.

REFERENCES

[1] Williamson, M. Software reuse. CIO Magazine 1 March (1997). Available at: http://www.cio.com/archive/030197_technology.html
[2] Jacobson, I., Griss, M. and Jonsson, P. Software reuse: Architecture, process and organization for business success. (New York: ACM Press, 1997).
[3] Software reuse: Current practice and potential. SEM 1021. (Wokingham: ERS, 1996).
[4] Griss, M.L. Software re-use: From library to factory. IBM Systems Journal 32(4) (1993) 548-566.
[5] Davenport, T.H. Information ecology: Mastering the information and knowledge environment. (Oxford: Oxford University Press, 1997).
[6] Dikel, D.M. et al. Software reuse reuse study, 1996. (Applied Expertise, 1996). Available at: http://dii-sw.ncr.disa.mil/reuseic/lessons/benchmark/html.bench.htm
[7] Available at http://dis.sema.es/projects/SER/sermain.html
[8] Saiedian, H. and Zand, M. A framework for evaluating software environments that support design reuse. Journal of Computing and Information Technology 5(4)(1997) 249-264.
[9] Morel, J.M. Experiences of reuse with the REBOOT method. Genie Logiciel, 42 (1996) 45-50.
[10] Sindre, G., Conradi, R. and Karlsson, E-A. The REBOOT approach to software reuse. Journal of Software and Systems 30(3) (1995) 201-212.
[11] Karlsson, E-A. Software reuse: A holistic approach. (New York: John Wiley, 1995).
[12] Kovacs, G.L., Kopacsi, S., Nacsa, J., Haidegger, G. and Groumpos, P. Application of software reuse and object-oriented methodologies for the modelling and control of manufacturing systems. Computers in Industry, 39(3) (1999) 177-189.
[13] Available at http://www-cs.open.ac.uk/euroware/euroware.html
[14] Software Technology for Adaptable Reliable Systems (STARS). Reuse Strategy Model: Planning Aid for Reuse-based Projects. Boeing STARS Technical Report D613-55159. (Arlington: STARS Technology Center, 1993).
[15] Klingler , C.D. DAGAR: A Process for Domain Architecture Definition and Asset Implementation. In: Proceedings of ACM TriAda 96. (New York: ACM, 1996).
[16] Macala, R.R., Stuckey, L.D. and Gross, D.C. Managing domain specific product line development. IEEE Software May (1996) 57-67.
[17] Xiaoqun, C. and Weizhong, S. Supporting Project-Centered Reuse in Object-Oriented Software Development. In: Technology of Object-Oriented Languages and Systems-Tools ? 24, September, 1997, Beijing, China. (New York: IEEE, 1998).
[18] Diaz, I., Velasco, M., Llorens, J. and Martinez, V. Semi-automatic construction of a thesaurus applying domain analysis techniques. International Forum on Information and Documentation, 23(2) (1998) 11-19.
[19] Velasco, M., Diaz, L., Llorens, J., de Amescua, A. and Martinez, V. Statistical filtering techniques applied to obtaining hierarchical relationships in the automatic construction of a thesaurus. Revista Espanola de Documentacion Cientifica, 22(1) (1999) 34-49.
[20] Davenport, T.H. Process innovation: Reengineering work through information technology. (Boston: Harvard Business School Press, 1993).
[21] Ould, M.A. Business processes: Modelling and analysis for re-engineering and improvement. (Chichester: John Wiley, 1995).
[22] Bocij, P. et al., Business information systems: Technology, development and management. (London: Financial Times Management, 1999).
[23] Llorens, J. A framework for client-server reuse: Reusable Artefacts Methodology (RAM). (Madrid: The AUTOSOFT Consortium, 1999).
[24] Definition of system architecture. (Madrid: The Autosoft Consortium, 1999).
[25] Information Technology - Text and Office Systems - Standard Generalized Markup Language (SGML). ISO 8879-1986. (Geneva: ISO, 1986).
[26] Extensible Markup Language (XML). (World Wide Web Consortium, 1998). Available: http://www.w3c.org/TR.1998/RECxml-19980210
[27] Fuhr, N. Toward data abstraction in networked information systems. Information Processing and Management 5(2) (1999) 101-119.
[28] Lundquist, C., Friedler, O., Holmes, D.O. and Grossman, D. A parallel relational database management system approach to relevance feedback in information retrieval. Journal of the American Society for Information Science 50(5) (1999) 413-426.
[29] Grossman, D., Holmes, D.O., Friedler, O., and Roberts, D. Integrating structured data and text: a relational approach. Journal of the American Society for Information Science 48(2) (1997) 96-121.
[30] Wilkinson, R. Effective retrieval of structured documents. In: SIGIR’94: Proceedings of 17th ACM-SIGIR Conference on Research and Development in Information Retrieval, Dublin City, 1994. (London: Springer-Verlag, 1994) 311-317.
[31] Bohm, K. Building a configurable database application for structured documents. Technical Report No. 942. (Darmstadt: GMD,1995).
[32] Kaszkiel, M. and Zobel, J. Passage retrieval revisited. In: SIGIR’97: Proceedings of 20th ACM-SIGIR Conference on Research and Development in Information Retrieval Philadelphia, 1997. (New York: ACM, 1997).
[33] Guan, T. and Wong, K.F. KPS: A web information mining algorithm. Computer Networks 31(11) (1999) 1495-1507.
[34] Rath, H.H. XML; chance and challenge for online information providers. In: Online Information 98. Proceedings of 22nd International Online Information Meeting, London, 8-10th December 1998. (London: Learned Information, 1998) 339-345.
[35] XML Metadata Interchange (XMI). Proposal to the OMG OA & DTF RFP 3: Stream-based Model Interchange Format (SMIF). OMG Document ad/98-10-05. (Framingham: OMG, 1998).
[36] Eriksson, H-H. and Penker, M. UML toolkit. (New York: John Wiley, 1997).
[37] Booch, G. Object-oriented analysis and design with applications. (Redwood City: Benjamin Cummings, 1994).
[38] Jacobson, I., et al. Object-oriented software engineering. (Reading: Addison-Wesley, 1992).
[39] Rumbaugh, J, et al. Object-oriented modeling and design. (Englewood Cliffs: Prentice-Hall, 1991).
[40] Sonnenberger, G. and Frei, H. Design of a reusable IR framework. In: SIGIR’95: Proceedings of 18th ACM-SIGIR Conference on Research and Development in Information Retrieval, Seattle, 1995. (New York: ACM, 1995). 49-57.
[41] Frei, H.P. Information retrieval - from academic research to practical applications. In: Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, April 1996.
[42] Sonnenberger, G. Exploiting the functionality of object-oriented database management systems for information retrieval. IEEE Data Engineering Bulletin 19(1) (1996) 14-23.
[43] Available at http://www.bradandkathy.com/genealogy/overviewofsoundex.html
[44] Bratvold, T. Union Bank of Switzerland, IT Laboratory, PO Box, CH-8021 Zurich, Switzerland. (Personal communication).
[45] Available at: http://www.dcs.gla.ac.uk/mira/themes1.html
[46] McCartan, C., Sweeney, N., Gibb, F. and O'Donnell, R. Review of industry perspective of software re-use. Glasgow: AUTOSOFT Consortium, 2000.

Document Category	Description
Analysis documents	Produced during the analysis phase of the development process, these will typically be specifications and requirements documents but may also include documents with embedded domain knowledge and analysis. This category of document is composed of textual documents and potentially documents drawn up in UML
Design documents	Typically, documents concerning and describing the proposed or actual technical function of software. Again, both textual and UML documents may be present in this category.
Source code	This category represents all software code generated during the development process and, potentially, associated test cases. This includes textual source code with its embedded comments and interface descriptions for components.