[This local archive copy is from the official and canonical URL, http://lac.uic.edu/~balinder/thesis.htm, text only; please refer to the canonical source document if possible.]

Providing Support for Resource Management Tools in a Wide Area High Performance Distributed Data Mining System

by
Balinder Singh Malhi
EECS 598 : Master's Thesis in Computer Science

Advisor : Prof. Robert Grossman

May 07, 1998
Laboratory for Advanced Computing
University of Illinois at Chicago.

Summary of Work || Acknowledgements || Table of Contents
Postscript version (1.5 MB, 62 pages)

Summary of Work

In this thesis report, a framework for predictive modeling in distributed and collaborative data mining environments is detailed. The framework uses meta-model and meta-cluster approaches and is a suitable test-bed for experiments on ensemble learning, partitioned learning and meta-learning. Two components of the framework, namely Model Interchange Format (MIF) formalism for exchanging predictive models and Resource Information Format (RIF) formalism for management of resources are described. The performance analysis of a system implementing this framework is presented as well.

This framework has been developed to meet three pressing needs. (1) to support distributing a data mining task among different datasites . Such a scenario is possible because of the inherent distribution of data in applications or due to the inapplicability of the current generation of learning algorithms to very large data-sets; (2) to allow for collaborative data mining, i.e., the inter-operability of data mining systems; and (3) to support benchmarking of models and techniques.

Model interchange format is the meta-model used to describe predictive models. This meta-model facilitates the migration and easy ingestion of these models at geographically remote sites. The Meta-cluster approach tackles the problem of distributed and large data-sets by providing an infrastructure for the management and transparent communication of resources between sites.

The guiding principles and the specifications of an SGML/XML based model interchange format and resource information format are presented. These specifications are the result of meetings between the representatives of member organizations of the Data Mining Group (DMG).

Papyrus, a distributed and scalable data mining system that employs general meta-model and meta-cluster approaches, has been implemented as a proof of concept of the framework. Papyrus provides a set of learning programs, implemented as C++ applications, which can compute predictive models over data stored locally at several distributed clusters/sites. Papyrus also provides a set of software agents and directory services for combining multiple models that were learned (perhaps) at different sites.

The overall architecture of the Papyrus system and the specific implementation currently under development at the Laboratory for Advanced Computing (LAC), University of Illinois at Chicago (UIC), are presented. The experimental setup and results of performance tests for speed and accuracy analysis using Papyrus for inherently distributed data are presented.

One of Papyrus's target applications uses healthcare data obtained from the University Health Consortium (UHC). This application was one of LAC's demonstrations at the Internet 2 conference at Washington DC (Oct. 7-9, 1997) and also at Supercomputing 97 (Nov. 16-21, 1997) in San Jose, CA. Interested readers can learn more about this project at http://www.lac.uic.edu/~balinder/medical/sc97/top.html.

Acknowledgements

I would like to thank my thesis advisor, Dr. Robert Grossman, for introducing me to data mining and for all his guidance and support. This thesis work was enabled and sustained by his vision and ideas.

Stuart Bailey is due a special note of thanks for his unique way of inspiring students through clarity of thought, enthusiasm and caring. He is a model to all laboratory students. His technical excellence, unwavering faith and constant encouragement were very helpful and made this effort an enjoyable one.

I wish to thank Dr. Donald Mon for providing us with the healthcare data and introducing us to issues in the healthcare management.

I wish to thank Dr. Simon Kasif for his guidance and comments on the experiments.

I am grateful to Shirley Connelly and David Turkington for proof reading this report. Their comments and suggestions were very valuable as the thesis progressed.

A number of individuals in the Laboratory for Advanced Computing were very helpful and understanding. I would like to acknowledge Ashok Ramu, Marco Mazzucco and Srinath Gutti in particular for their contribution to this work.

TABLE OF CONTENTS

1. Introduction *

1.1 Predictive Modeling Culture in Data Mining *

1.2 Current Challenges in Predictive Modeling *

1.3 A historical and philosophical perspective *

1.4 Scope of this work *

2. PMML : A Meta-Model Formalism for Predictive Models *

2.1 What is a Meta-Model Approach ? *

2.2 Approaches to Model Specification *

2.3 Related Modeling efforts *

2.4 Design Goals for Model Interchange Format *

2.5 SGML/XML based formalism for Model Interchange Format *

2.6 Advantages of using SGML/XML based formalism *

2.7 File format for Single Predictive Models *

2.8 File format for Ensemble of models *

3. Resource Information Management in heterogeneous, distributed data mining environments *

3.1 Characteristics of the Underlying Architecture *

3.2 Unified Global Resource Management Model *

3.3 Role of Meta-data in Emerging Systems *

3.4 Logical Components of Metadata *

3.5 Design Goals for Resource Information Format *

4. Papyrus: Philosophy, Design and Architecture *

4.1 Meta-Cluster Approach *

4.2 Papyrus Architecture *

4.3 Major Components of Papyrus *

4.4 Advantages of Papyrus Architecture *

5. Performance Analysis of the Multiple-Model Learning Approach for Inherently Distributed Data-Sets using Papyrus *

5.1 Inherently distributed data *

5.2 Experiments and Results *

5.2.1 Goals *

5.2.2 Setup *

5.2.3 Time and Accuracy Results *

5.3 Initial Observations *

5.3.1 Time Complexity *

5.3.2 Prediction Error Rates *

5.4 Initial Conclusions *

6. Conclusions and Future Work *

APPENDIX A: File Format for Single Predictive Models *

APPENDIX B: File Format for Ensemble of Models *

APPENDIX C: File Format for Resource description *

CITED LITERATURE *

1. Introduction

This chapter introduces the predictive modeling culture in data mining and addresses the research challenges of predictive modeling in distributed and collaborative data mining environments. It also provides a brief historical and philosophical perspective on the evolution of these challenges and concludes with the scope and motivation of this work.

1.1 Predictive Modeling Culture in Data Mining

Formally, data mining can be defined as the process of automatic extraction of patterns, associations, changes and anomalies in large data sets (Grossman, 1997a, unpublished data). This process involves some search, which may be done by the user or may be assisted by a smart program that automatically searches the database for significant patterns. The issues raised in some data mining problems over large data-sets could be equally relevant for small data-sets as well (Grossman, 1997a, unpublished data).

Traditionally there have been two types of statistical analyses: Confirmatory Analysis and Exploratory Analysis (Parsaye, 1997a). In confirmatory analysis, also called verification model, the user has a hypothesis and either confirms or refutes it. In exploratory analysis, also known as discovery model, as pioneered by John Tukey (Tukey, 1973), one finds suitable hypotheses to confirm or refute. Here the system takes the initiative in data analysis, not the user. Data mining seems to fit in the mold of exploratory analysis by our definition of the problem. The system is expected to initiate the questions by itself.

From the process-oriented view, there are three classes of data mining activities: discovery, predictive modeling and forensic analysis as shown in the figure 1. This figure is adapted from (Parsaye, 1997b).

Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be (Parsaye, 1997b). This deals with the application of various types of approaches and algorithms to find these patterns or models in the data. These include logic methods (rules, decision trees, etc.) and equational methods (statistics, neural nets, etc.).

In Predictive Modeling, models discovered from the data-set are used to predict the future. Predictive Modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on the previous models discovered from the data-set. A predictive model could be a classifier like a decision tree or linear regression.

Forensic Analysis is the process of applying the extracted models to find anomalous or unusual data elements. The focus of this thesis is on predictive modeling.

1.2 Current Challenges in Predictive Modeling

One means of acquiring new knowledge/models from databases is to apply various machine learning algorithms that compute descriptive representations of data as well as patterns that may be exhibited in the data. Chen, et al. (1996) have presented an excellent overview of the range of machine learning algorithms and their applicability. Also there is excellent research literature available on the scalability of algorithms over large data-sets (Parsaye 1996, Provost 1997, and Chan and Stolfo 1995). However, new challenges arise when the data is distributed and when there is a need for the inter-operability of systems.

Distributed Data Mining Environments

Two requirements dictate distributed data mining, i.e., building predictive models over local data-sets on remote sites and then using model selection or averaging to combine the results at a central site.

Sometimes, data is inherently distributed. An example is the Health Care database of UHC (University Health Consortium), which is a consortium of 60 university hospitals. It is possible to transport the data to one central site and build one predictive model over this data-set. This model may provide the best possible accuracy, limited only by the algorithm used. But when data-sets are very large, the cost of transporting data to a single site is usually high and, at times, unacceptable.

At other times, data is artificially distributed. This could be because many of the existing learning algorithms require all the data to be resident in memory, which is clearly untenable in many real-world data-sets. Provost (1997) has suggested data partitioning as one of the solutions to scaling up such algorithms. Also Domingo (1997) and Drucker (1997) have shown that the error rate of classifiers like trees can be reduced by learning multiple models and combining the results. This multiple-model approach is often referred to as ensemble learning, partitioned learning, distributed learning or meta-learning based on the context. (Grossman, 1998a).

Hence, the underlying approach in these distributed data mining environments is to build predictive models locally and then transport the models to a central site. This approach is usually expected to be more efficient in terms of run-time and network traffic costs. Since models are built in parallel across remote sites, run-time costs are reduced. Also, since serial code is executing at each site, it is not necessary to write parallel code that needs to coordinate processes across the network, though parallelism at local sites can still be exploited. Network costs are less because the models are expected to be smaller than the data-sets. However, the challenge of combining the models using selection or averaging for maximum prediction accuracy is a fairly serious one.

Collaborative Data Mining Requirements

The challenge here is: 1) How to inter-operate the vast number of the data mining systems being used by the people today and, 2) how to perform benchmarking tests.

"The proprietary nature of the available data mining system is inhibiting the growth of the data mining industry as a whole." (Grossman, 1997b)

The above observation indicates the need for collaborative modeling in data mining. Realistic data mining systems are becoming much too complex for any single group of researchers to implement single-handedly, requiring collaboration between people from other specializations. There is a need for the exchange and incorporation of predictive models between data mining systems, probably on different hardware/software platforms.

Communicating the structure of the model to others can become an insurmountable obstacle to collaboration and acceptance of the predictive model.

The above two requirements necessitate the need for a distributed data mining framework, which provides a mechanism for managing and transporting predictive models between geographically remote sites. There is also a need for a universally accepted common description format for these predictive models that allows for easy exchange and incorporation of models between systems, perhaps, from different vendors.

This description format can also form the basis for benchmarking tools and standards, and may be important for the commercial acceptance of data mining products. According to David Clark’s theory of apocalypse of two elephants shown in figure 2 (Tenanbaum, 1996), it is essential that reasonable standards be formed before the onrush of commercial investments.

1.3 A historical and philosophical perspective

To understand the current challenges better, it is both interesting and useful to trace the evolution of data mining systems.

Computers can now store all forms of information: records, documents, images, sound recordings, videos, scientific data, and many other data formats. Computer scientists have made great strides in capturing, storing, managing and visualizing this data. These tasks are generally called data management.

There have been several distinct phases in data management (Gray R, 1996). Initially, data was manually processed, with the first known records dating back to 4000 BC in Sumeria for keeping track of royal assets and taxes. The first practical programmatic information processing began in the early twentieth century with the introduction of punched-cards and electro-mechanical computers to sort and tabulate millions of records. The third phase stored data on magnetic tape and used stored- program computers to perform batch processing on sequential files. The fourth phase automated the concept of a database schema and online navigational access to the data (Gray R, 1996). The fifth step automated access to relational databases and added distributed and client-server processing. We are now in the sixth generation systems and these systems not only store much richer kinds of data (like images, voice and video data), but provide specialized features like data warehouses and data marts to enable fast access to relevant data. A Data Warehouse (Grossman, 1997a and Mon, 1997) is a data management system structured to facilitate the analysis of data. It is here that all the corporate data is stored, in one place using normalized data structures. A Data Mart (Mon, 1997) is a place where departmental data is stored and often external data items are added. They mostly deal with the data space.

While all the advances in the data management technologies have allowed us to store huge amounts of data, a parallel problem is the problem of data analysis on these large data-sets. Also at times, data is inherently distributed and cannot be localized on any one machine. Distribution adds one more dimension to the data analysis problem.

The Earth Observation System / Data Information System (EOS/DIS) (Gray R, 1996 ) being built by NASA is expected to accumulate 5 terabytes of remote sensor data each day. There are other scientific, government and commercial databases that are growing at a rapid pace . An example of the distributed database is the TIMSS project student database (TIMSS, 1996).

Current database search techniques, like SQL reporting or spreadsheets, do not scale well to these large and distributed data sets. Users expect not only automated management of these data-sets, they also expect "simple and powerful automated tools" to browse, search, analyze and visualize the data.

"These requirements are stretching the limits of what we know how to do today." (Parsaye, 1996)

The above problem is not just a database problem. It has attracted people from such diverse backgrounds as machine learning, statistics, computational science, control theory and many more, to apply their skills and techniques, in a unified goal to develop automated tools and techniques for analyzing huge data-sets. After the expected debates on techniques from various fields, there is a convergence of ideas on the nature of the problem and also the applicability of solutions. A new field known as Data Mining has emerged from these debates.

1.4 Scope of this work

This thesis work is aimed at proposing a framework for predictive modeling in distributed and collaborative environments. The focus is to attack the problem from the application design viewpoint.

The following forces motivate this work. We believe that the amount of research into application level design has been lacking in the field of data mining. A great deal of attention within the data mining community has focussed on the discovery part, i.e., the learning algorithms and other related issues. While this is very important and serious work, it is important that efforts be continued with equal vigor at the application design end of the data mining spectrum.

Meta-Model and Meta-Cluster approaches are proposed to handle mining of distributed data within the same system, or to allow for inter-operability of different systems.

The meta-model approach deals with predictive models. The emphasis is not on the process of model formation, but on the description of predictive models. This approach seeks to use a declarative formalism (based on SGML/XML), for describing the predictive models. This declarative Model Interchange Format (MIF) is described using a language known as the PMML (Predictive Model Markup Language) (Grossman et al., 1998 and PMML, 1998).

A meta-cluster approach is being proposed as a possible solution to the problem of distributed data mining architectures over the wide area. This seeks to compute a number of independent classifiers by applying learning programs to a collection of independent and geographically distributed databases in parallel. The results of the independent classifiers (i.e., derived models) are then combined using the PMML.

A distributed resource management infrastructure that uses Tcl based software agents (Gray et. al., 1996) and SMGL based meta-data representation is presented. The core component of the infrastructure, i.e., Resource Information Format (RIF) is discussed in detail.

These approaches are demonstrated using a prototype implementation, Papyrus. The performance results of Papyrus on distributed data for multiple-model learning purposes are detailed. We have used healthcare data obtained from University Health Consortium (UHC) and C4.5 learning algorithm (Quinlan, 1993) for experiments.

The rest of the work is presented as follows:

Chapter 2 discusses the idea of meta-models based on Markup Languages like SGML/XML and also describes the prototype MIFs developed using PMML.

Chapter 3 describes the proposed resource information management infrastructure for distributed data mining architectures and details its core element, RIF.

Chapter 4 presents the architecture and design of the prototype system, Papyrus, along with the implementation details.

Chapter 5 details the experimental setup and the results of performance tests on the UHC patient database using Papyrus architecture.

The report concludes with a discussion of our experiences with the current work and directions for future work in Chapter 6.

2. PMML : A Meta-Model Formalism for Predictive Models

In the previous chapter, a case was made for the necessity of exchanging predictive models in distributed and collaborative environments. In this chapter, a meta-model approach for the description and, hence, exchange of models is proposed. The design guidelines and specifications of a model interchange format based on this meta-model are presented. This MIF uses PMML as the description language. In the next chapter, other controls needed for exchange of models in a distributed framework are discussed.

2.1 What is a Meta-Model Approach ?

Our experience with data mining systems has indicated that users want semantically expressive models for their particular application domain, e.g., decision trees for medical data or Naïve Bayes algorithms for text mining. On the other hand, vendors need to provide general purpose tools. Hence from the application design viewpoint there is a need to concentrate on describing meta-models from which different predictive models can be built.

If predictive models are representations of some learning concept, then meta-models are the description of predictive models. Meta-models use a language to define all well-formed models that may be represented. Similarly, in order to flexibly describe a meta-model, a meta-metamodel is required. A meta-metamodel defines an abstract language for specifying meta-models. A meta-model is an instance of a meta-metamodel, which is a language for specifying models. This approach was first detailed by David Pullar and Kristin Stock (Puller and Stock, 1997) for geospatial modeling. The meta-model in our case is the model interchange format (MIF) and the language it uses to define well-formed predictive models is the Predictive Model Markup Language (PMML). The meta-metamodel uses the abstract Languages SGML/XML for specifying instances of meta-models (e.g., PMML, HTML, etc).

The meaning of these abstract concepts is depicted in figure 3 using an analogy to the CORBA (CORBA, 1995) and HTML systems. This figure is adapted from Crawley, et al. (1997).

If a meta-metamodel describes a type system for CORBA IDL, then the meta-model would be certain schemas of CORBA IDL types, and the entities in the model would be CORBA objects. Similarly, a meta-metamodel would be an SGML/XML system for both text documents and predictive models. For the text documents, the meta-model, i.e., Text Interchange Format (TIF) uses HTML as the defining language. On the other hand, our system for predictive models uses PMML as the language for describing the Model Interchange Format meta-model.

At an abstract level any meta-model consists of two parts: 1) a model specification and 2) a model executable. The model specification is the description of the model that is used to generate the model instance or executable. Even though we are using SMGL/XML based declarative formalism for predictive model specification, in general three approaches are possible. Section 2.2 details these approaches and section 2.3 discusses the related modeling efforts. In section 2.5, design goals of our approach are explained.

This thesis deals with the specification requirements of the predictive models, leaving the model executable duties or choices to the various people exchanging the models.

2.2 Approaches to Model Specification

A declarative formalism: A declarative formalism uses a high level specification of the structure and dynamics of the model. The specification contains no specific computational guidelines, but contains enough information to algorithmically derive the model executable. This high level formalism could be in a proprietary format or some open universal standard like SGML/XML.

Source code: In this approach, a general-purpose programming language such as C or C++ is used to describe the model. The model specification contains all the information necessary to run the model as does, e.g., the C4.5 package (Quinlan, 1993) for decision trees.

Procedural language: A procedural language specification defines a model using a specialized modeling language (e.g., BN language (BNIF, 1997) used by Microsoft for Bayesian network interchange format) at a level of abstraction that is above the general-purpose programming languages, but below a purely declarative formalism.

2.3 Related Modeling efforts

There have been few efforts to date in the data mining community for the modeling of the predictive models.

In the summer of 1996, the program co-chairs of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-96) in Portland arranged a general meeting on the Bayesian Network Interchange Format through the efforts of the Decision Theory Group at Microsoft Research and a team at the University of Aalborg in Denmark. The format that came out of these efforts uses a BN mini-language (BNIF, 1997) that has borrowed many features from the C/C++ language and includes a parser written in YACC. This formalism is a kind of procedural formalism.

There is a lot to be learned from the modeling efforts in some other areas as well. Scientists at the Marshall Space Flight Center (Jaap and Davis, 1994) at NASA have developed a model interchange format for scheduling models among the community of space activity schedulers. This format is declarative in nature but proprietary to NASA. Also, PowerSim Corporation has developed some declarative model interchange formats for their System Dynamics simulator (PowerSim, 1997).

2.4 Design Goals for Model Interchange Format

Four significant requirements drove the formulation of the model interchange format for predictive models. These were that any MIF must be universal, portable, human readable and extensible. Since the format was developed chiefly for the interchange of models, rather than for the storage and manipulation of models within data mining engines, these requirements were deemed to be more important than efficiency.

Universality: A format was needed that could be used and understood easily by most of the people. Also, it was felt that an evolutionary approach based on some existing format would be better than a revolutionary approach. This would also allow the leverage of existing technologies.

Portability: It was necessary that the format be portable from one data mining engine on one platform to another engine on a different hardware/software platform.

Human-readable: The format should be human readable so that a person could use, easily view, or edit the model instance.

Extensibility: Since it was expected that the format would start with a small number of learning models and would incorporate new models later, it was felt that extensibility of the format is also a prime concern.

2.5 SGML/XML based formalism for Model Interchange Format

In view of the above design goals, a declarative formalism based on SGML/XML was chosen for predictive model specifications. Predictive Model Markup Language (PMML) is used to define the structure and intent of the predictive model. PMML is a simple markup language that uses SGML/XML as its meta-language in a manner similar to the Hypertext Markup Language (HTML) used for world wide web documents. SGML is International Standard ISO 8879 -- Standard Generalized Markup Language (SGML, 1986) and XML is eXtended Markup Language (W3 XML, 1997).

"The predictive models described using PMML consist of several parts: 1) a header, 2) a data schema, 3) a data mining schema, 4) a predictive model schema, 5) definitions for predictive models, 6) definitions for ensembles of models, 7) rules for selecting and combining models and ensembles of models and 8) rules for exception handling. Component 5) is required. In addition a schema for the predictive model must be defined. This can be done using one or more of the schemas – components 3, 4 and 5. The other components are optional."(Grossman, 1998a).

PMML 0.8 uses SGML as its meta-language and a C++ based parser library (SP) to provide facilities or routines for generating and validating MIF documents. PMML 0.9 uses XML and a Java based parser library.

As an SGML application, the syntax of conforming PMML 0.8 documents is defined by the combination of the SGML declaration and the document type definition (DTD). This specification defines the intended interpretation of PMML 0.8 elements, and places further constraints on the permitted syntax which are otherwise inexpressible in the DTD.

A PMML document is stored in a file with a .pmml extension. Also, the document type definition for PMML is specified using SGML/XML and stored in a file with a .dtd extension.

2.6 Advantages of using SGML/XML based formalism

The advantages of using an SGML/XML based formalism are many.

First, the format uses a syntax similar to the already popular HTML syntax, facilitating a steep learning curve and quick acceptance.

Second, it allows for the usage of freely available utilities for the parsing of SGML documents.

Third, the syntax is simple and human readable. It has a cascading outline hierarchy and the user can edit the documents with normal text editors.

Fourth, it is very easy to add extensions to such ASCII text based formats.

A related issue is the trends in the practical applications. In fields like healthcare, SGML based formats are fast becoming the standard.

This text based approach may not be as efficient as other methods, but as discussed in section 2.4, in our view that is not a primary requirement here.

The current implementation of PMML 0.8 supports predictive models like decision trees (C4.5, ID3 and CART (Brieman et al, 1984)) and linear regression. The modular design allows for the later addition of specifications for other models once the experts in those fields come up with a format.

2.7 File format for Single Predictive Models

The structure of a PMML document for a single predictive model is described in Appendix A. The format provides mechanisms for describing the data schema as well as the predictive models (like C4.5, CART, ID3 and linear regression).

2.8 File format for Ensemble of models

"Ensembles of models are important for several reasons. 1) Ensembles of models often yields more accurate predictors (Brieman, 1996 and Wolpert, 1992). 2) Large data sets may be partitioned, the different partitions may be mined in parallel to produce an ensemble of models, and then the results combined using model averaging (Grossman, 1996). 3) Distributed data may initially be mined separately to produce an ensemble of models and then a single predictive model produced from this ensemble. Sometimes this is called meta-learning (Chan and Stolfo, 1995)." (Grossman, 1998a). The structure of PMML documents for ensembles of models is described in Appendix B. The key addition for ensembles of models is the introduction of the model-weight element.

3. Resource Information Management in heterogeneous, distributed data mining environments

In the previous chapter, a formalism for the description of predictive models was detailed. Many more control mechanisms are needed for this formalism to work in heterogeneous and distributed data mining environments. This chapter presents an experimental framework for resource management in these environments and details its core structure, the RIF (Resource Information Format) model for unified metadata representation and management. This resource management infrastructure is designed on need basis for an underlying architecture whose characteristics are presented first.

3.1 Characteristics of the Underlying Architecture

The following features of the architectures need to be considered in designing the resource management framework.

Clusters of workstations over LAN/WAN.

To accommodate a large number of cooperating sites, the sites are hierarchically grouped into clusters, with each cluster having a well-defined point of entry. Since a Wide Area Network might link the clusters, each cluster might have different hardware and storage capabilities.

Total local autonomy.

Each cluster must have complete control over its own resources. That includes what data and knowledge resources to keep and where to keep them. Each site can choose its data representation (e.g., relational, object-oriented, etc.) and appropriate data discovery tools (decision trees, neural networks, etc.).

Global Query Formation and Allocation.

Global queries are dynamically formed and allocated to clusters based on the latest resource information available. Thus, the resource information gathering is event-triggered.

Query Execution Strategy: Moving Query to the Data.

Since moving large amounts of data around is a heavy-weight operation, the query is moved to the data. Also, the data allocation is static or infrequently changing for such architectures.

3.2 Unified Global Resource Management Model

A global information resources framework is developed to meet the requirements of the above architecture. It is global because the names in this framework are global. Two basic elements are necessary for the resource management environment:

A logical model to define and represent the meta-data or information about the data.

A management system providing facilities for effective implementation and processing of meta-data.

The logical model for meta-data representation is a core element of this infrastructure and is physically embodied in the proposed Resource Information Format described below.

The management system includes the cluster-point-of-entry infrastructure for cluster management and the Tcl software agent system (Ashok, 1998) to communicate between the clusters.

The emphasis in this chapter is on meta-data design and format. Refer to chapter 4 for details on the management system.

3.3 Role of Meta-data in Emerging Systems

Meta-data is a key element in any resource management system. Representative (previous) meta-data systems include traditional data dictionary systems, the Information Resources Dictionary System (IRDS) of the National Institute of Standards and Technology (ANSI X3H4, 1985), the metadatabase system (Hsu, et al., 1991) and, more recently, the Meta Data Council’s Interchange Specification Initiative (MDC, 1997). Most of these are limited to containing only data resources models. The scope of the meta-data needs to be extended for emerging data mining systems because of the following two considerations.

Knowledge Resources

In data mining systems, the meta-data must represent both data systems and knowledge resources. Thus, the contents of the meta-data include not only information about the data resources (like the data schemas, etc.) but also information like the knowledge models (e.g., predictive models), data discovery tools and data extraction tools.

Wide Area Networks

To realize distributed and collaborative data mining environments, the meta-data should model not just the subsystems, i.e., individual sites, but also the information that formulate the interactions between sites that may be heterogeneous and located over a Wide Area Network (WAN). WANs introduce heterogeneity like disparate data views that may not exist for LANs.

Thus, the scope of meta-data encompasses both data resources and knowledge, and requires new methods and techniques to manage these information resources in a unified way for diverse user groups. The logical components of meta-data in distributed data mining systems are discussed next.

3.4 Logical Components of Metadata

There are several requirements for an ideal meta-data representation method that can be derived from the above discussion. The requirements include:

The ability to provide a consolidated representation of heterogeneous data views (relations and object hierarchies).

The ability to represent the major classes of knowledge, i.e., decision knowledge (predictive models) and operating knowledge ( cluster information, domains and data mining engines).

The unified self-descriptive representation of the full content of these meta-data.

While the data views and decision knowledge are essential for intra-cluster autonomy and functioning, operating knowledge is the glue that acts as an active facilitator for inter-cluster information integration and allows users to tap resources over wide area clusters for global query formation and processing.

We have chosen a relation look-alike approach to accommodate heterogeneous data views. Thus a dataset element is viewed to consist of several table elements and associated predictive model elements. A table element is described by a set of attribute-descriptor elements that contain information like use-as and data-type which are required by data discovery algorithms. A model element has information like the model name and model type.

Dataset elements are encompassed by domain elements that help to generalize the framework over different domains. The domain elements, in turn, are encompassed by cluster elements. Cluster elements represent the cluster-point-of-entry and have information on the data discovery tools available at the cluster in the form of data mining engine elements.

These meta-data elements in our framework are described by what we call as the Resource Information Format (RIF). The design guidelines and specifications of this format are presented next.

3.5 Design Goals for Resource Information Format

The design goals for the resource interchange format are similar to the design goals for MIF, i.e., universality, portability, extensibility and human-readability.

Using reasoning similar to MIF, an SGML based declarative formalism was chosen to describe RIF for easy integration into the system. The language used to describe the format is also called RIF in the present implementation, though ideally it should be called RIML (Resource Interchange Markup Language). The specifications for the RIF 0.8 implementation are described in Appendix C.

4. Papyrus: Philosophy, Design and Architecture

In this chapter, an overview of the meta-cluster approach is presented and Papyrus, a system that employs PMML for inter-operability, RIF for resource information management and the meta-cluster approach to address large scale distributed data mining applications is described. The Papyrus reveals a powerful and easily extensible software agent based system that computes classifiers over the distributed data. Papyrus is being used in experiments dealing with real-world learning tasks as predicting patient outcomes in medical data. We are also using Papyrus as a work-bench for experiments on multiple-model learning.

4.1 Meta-Cluster Approach

Meta-cluster technique provides a unifying and scalable solution that improves the efficiency and accuracy of inductive learning when applied to large amounts of data in wide area computing networks for a range of different applications.

In this approach, data is distributed over a number of sites, with all the sites on a particular LAN being identified as a cluster. The clusters are connected to each other via a WAN. A number of learning processes (each implemented as a distinct serial program) are executed in parallel, on these data sets, or data subsets. The collective results are then combined through meta-cluster. Stolfo, et al., (1997) and Kargupta, et al., (1997) have also investigated the meta-cluster approach.

4.2 Papyrus Architecture

Papyrus is architectured as a distributed computing construct. Papyrus uses PMML for the predictive model description, RIF for the resource information description, Tcl agent (Gray J, 1996) for communications between clusters, and Ptool (Grossman, et al., 1994) as the data warehouse.

Papyrus is implemented as a collection of distributed learning and classification programs linked together through a network of clusters. Each cluster consists of:

A well defined Cluster Point of Entry (CPOE).

A local resource information repository.

An agent server.

A persistent local data set(s).

A data mining engine comprised of one or more learning programs implemented as the native applications callable by an agent server.

First, the user accesses the Data Mining Point Of Presence (DMPOP) through a web browser. After proper authentication the DMPOP sends the agents out to all clusters accessible via CPOEs to get the information about the resources available. These resources can be data-sets, existing predictive models or various data mining services. Based on this information (represented using RIF) the user then interactively builds a data mining query. The query could be to build a model on some particular data-set, score using some existing models, or visualize an existing model. The agents then dispatch this query to the relevant clusters and initiate their execution. The data mining engine at each cluster executes the query on the local data sets and the agents then send the results (represented using MIF) back to the DMPOP, which then displays them to the user. The logical architecture of the system is presented in Figure 4.

Figure 4: The logical architecture of Papyrus system.

4.3 Major Components of Papyrus

Data Mining Point Of Presence: This is the entry point for the user to access the data-mining services. The DMPOP has knowledge of the various clusters attached to it. It presents to the users the resources available at all the clusters, accepts the users’ queries, dispatches them to the relevant clusters and displays the results. Thus the geographic distribution of data is made transparent to the user who feels that all of these operations are happening on his desktop. In our implementation the web server acts as the DMPOP though this need not always be the case.

Agent based communication: Software agents form the backbone of communication between the DMPOP and the clusters. These agents transport the queries from the DMPOP to the various clusters, arrange for the parallel execution of the queries and present the appropriate results back to the DMPOP. The agents are capable of performing operations like selection or averaging, on the results (sometimes models) obtained from the various clusters.

Agent-Tcl (Gray J, 1996) was used as the transportable agent system, because Tcl (Ousterhout, 1994) is an embeddable scripting language that is highly portable, highly popular and freely available.

Cluster Point of Entry: The CPOE acts as the representative of the cluster to the outer world and also as the gateway for all the data and information flowing into and out of the cluster. The directory services at the CPOE maintain information about the computing resources, data-sets, models and data-mining engines available at that cluster. The resource information gathering is event-triggered and is represented using RIF. Well defined machine names were the CPOEs in this design.

Data Mining Services: Data mining services house various data mining engines, each implementing predictive model building algorithms like decision trees. Some of these algorithms are implemented to work with persistent data, thus enabling this architecture to scale up to large data-sets. Current decision tree algorithms include C4.5, CART and ID3.

Data Warehouse: The data-sets and predictive models built on these data-sets, are stored in a proprietary light weight persistent object manager Ptool. It has the capability to accommodate data from relational databases using ODBC drivers (Ramasamy, 1998), facilitating data incorporation from various sources.

This implementation has various library routines and utilities to clean and transform the data sets appropriately for the various predictive model building algorithms.

User interface: A web based user interface is provided for users to interactively drill down the resources and dynamically build and then refine the data-mining query. This uses an HTML/CGI/Perl mix and an SGML based data-mining query language.

Visualization: The visualization engine allows the predictive models to be viewed in two dimensional space using VRML or in a three dimensional space on the Immersa desk.

4.4 Advantages of Papyrus Architecture

No parallel code: Since Papyrus uses the same serial code at multiple clusters, it is not necessary to write parallel learning programs that need to co-ordinate between clusters across the network. However, parallelism can still be exploited at each cluster.

Easily Extensible: It is easy to add more sites to each cluster and also to add more clusters because of the hierarchical and modular nature of the cluster architecture and resource management infrastructure.

Scalable to large data-sets: Papyrus provides a distributed and persistency based approach to scaling very large data-sets.

Workbench for Multiple-Model Learning: Papyrus provides a very useful workbench for experiments on multiple-model learning approaches like meta-learning, partitioned learning, ensemble learning, etc. In the next chapter, performance results of Papyrus for multiple-model learning on healthcare data are presented.

Papyrus was included in the demonstrations performed by the Laboratory for Advanced Computing (LAC) at the Internet 2 conference at Washington DC (Oct. 7-9, 1997) and at Supercomputing 97 conference in San Jose (Nov. 16-21, 1997). Interested readers can learn more about this project at http://www.lac.uic.edu/~balinder/medical/sc97/top.html.

5. Performance Analysis of the Multiple-Model Learning Approach for Inherently Distributed Data-Sets using Papyrus

This chapter describes our initial experiments of using Papyrus for multiple-model learning on inherently distributed data over wide area networks. The time complexity and error rates of building c4.5 decision tree classifiers concurrently at multiple sites, transporting them to a central site and combining the scores using majority voting are tabulated. The results show that the time complexity for multiple-model learning is more linear in terms of the number of training examples as compared to the single model approach. Also, in certain cases, multiple-model learning outperforms the single model approach in terms of prediction errors.

5.1 Inherently distributed data

The increase in the amount of readily available information has furthered the development of distributed information storage. In certain cases, data is inherently distributed and cannot be localized on any one machine for a variety of practical reasons. In other cases data is artificially distributed due to the limitation of certain algorithms to work on larger data sets.

There are basically two approaches to build a global classifier on such distributed data. The first approach would be to move all the data partitions to one central site and then build a single predictive model on the entire data set. The second approach would be to execute a number of learning processes on these data partitions and then bring back the models to a central site to integrate the results using model selection or model averaging. The former would result in a lot of bandwidth usage while the latter will raise issues regarding the accuracy of the classifier. Current approaches to multiple-model learning like Bagging and Boosting focus on minimizing error percentage of decision tree classifiers. The data partitions used are not independent of each other, e.g., each data partition in bagging on the average has 63% of the original data set. Hence the focus so far has been mostly on error rate trends on artificially distributed data.

Our intention is to deal with inherently distributed data. . Learning from an ensemble of models from inherently distributed data raises several speed, accuracy and communication issues. The data partitions in such cases are disjoint and hence the accuracy results of techniques like Bagging and Boosting can't be directly related. Also, time complexity of this multiple-model approach is expected to be significantly less than the single-model approach, if the models are built concurrently. The nature of the networks involved i.e. WAN raises new communication issues. The network connection between geographically remote sites may or may not perform well compared to a local network of data sites.

We here present our experiences with the multiple-model learning approach using Papyrus. We have experimented with the C4.5 (Quinlan, 1993) decision tree classifier. The data-set we have used is the health care data obtained from University Health Consortium (UHC) and is naturally distributed on the zip-code information. The data sites at the Laboratory for Advanced Computing (LAC) at UIC, at the University of Maryland and at the University of Pennsylvania, connected via very high speed Backbone Network Service (vBNS) network were used.

Our intention is to focus on the speed and accuracy issues of such a system. The primary goal of the experiments is to present the empirical results comparing our approach with the alternative approach of moving the data to the query. Also we have presented the time complexity trends of building ensemble of models over the LAN and WAN. The statistical analysis of the error rates and communication issues would be the focus of future work.

5.2 Experiments and Results

In this chapter we describe the experimental setup used for the speed and accuracy tests.

5.2.1 Goals

The experiments conducted focussed on the speed and accuracy aspects of distributed learning. We intended to provide empirical results to compare the approach of moving the data to the query vs. the approach of moving the models, in case of large, inherently distributed data-sets. In particular we had the following goals:

Speed

1) Analyze the time complexity of building ensemble of models over a range of ensemble sizes.

2) Compare the time complexity of building ensembles of models over the local area network to that over the wide area networks.

Accuracy

Present the prediction error rates for the ensembles of models and compare them with the single model approach rates.

5.2.2 Setup

An inherently distributed non-trivial Patient database of Health Care data (distribution based on zip codes) was used. The data-set has 115 data attributes and 30 data mining attributes. The data-set has both continuous and discrete attributes.

Data is statically partitioned in a vertical manner at the UIC cluster ( 7 AIX machines), Maryland cluster ( 7 AIX machines) and Pennsylvania cluster ( 7 AIX machines).

Quinlan’s C4.5 decision tree was used to build trees at remote sites. A discrete attribute indicating Outcome for a patient and having three possible classes was used as the dependent attribute.

The predictive models were output by the learner in PMML format.

Tcl software agents were used to transport models to a central site at the UIC cluster.

A test data-set, equal in size to 25% of training examples was used on all the models at the central site for scoring.

Simple majority voting was used as the selection procedure on the scores obtained above for the test dataset.

Time was noted for each individual step (model building, network transfer, validation phases, etc.)

Error rates were noted on the final scores.

The experiments were conducted for training datasets ranging from 1000 examples (0.26 MB) to 200,000 examples (50.10 MB) and distributed between 1 to 21 data sites, both over the local network and the wide area network (vBNS). The experiments were performed by moving the query to the data and also by moving the data to the query.

5.2.3 Time and Accuracy Results

Figure 5 plots the prediction error percentage vs. the number of examples in the training dataset for no. of models ranging from 1,3,…to 21. The corresponding data is in Table I.

Figure 6 plots the time complexity in seconds vs. the number of examples in the training dataset for no. of models ranging from 1,3,5 … to 21 for local area tests. The corresponding data is in Table II.

Figure 7 plots the time complexity in seconds vs. the number of examples in the training dataset for no. of models ranging from 3,5… to 21 for wide area tests. The corresponding data is in Table III.

Figure 8 plots the time complexity in seconds vs. the number of examples in the training dataset for data transfer and model transfer approaches for 5 data sites. The corresponding data is in Table IV.

Table I: Prediction Error Trends for Models built on LAN and WAN

#Records	1-Model	3-Models	5-Models	7-Models	9-Models	15-Models	21-Models

1000	14.4	14	13.6	13.6	15.2	14.8	14
5000	14.9	14.96	15.36	15.36	15.12	15.6	15.2
10000	12.7	13.44	12.92	14.32	13.92	14.24	14.16
50000	13.8	11.984	13	12.912	13.072	13.072	13.126
100000	12.8	12.812	12.912	12.844	12.852	12.972	13.128
200000		12.37	12.16	12.284	12.226	12.228	12.198

Columns 2 through 8 show the percentage error in prediction.

Figure 5: Prediction Error Trends for Models built on LAN and WAN

Table II: Time Complexity Trends Using Local Cluster at UIC

#Records	1-Model	3-Models	5-Models	7-Models	Bench Mark

1000	5.05	9.7	14.56	20.64	2.26
5000	10.31	12.12	17.6	23.98	9.25
10000	18.46	15.5	20.91	27.98	18.62
50000	154.89	60.64	86.78	71.47	156.81
100000	437.47	131.04	114.03	137.48	424.14
200000

Columns 2 through 8 show the time taken (in seconds), to build the models and combine the results.

Figure 6: Time Complexity Trends Using Local Cluster at UIC

Table III: Time Complexity Trends Using Clusters at UIC, UPENN and UMD

#Records	3 Models	5 Models	7 Models	9 Models	15 Models	21 Models

1000	10.56	17.4	22	28.27	47.47	62.44
5000	14.06	21.1	26.4	33.01	53.18	77.94
10000	19.6	27.28	32.85	40.01	63.41	88.17
50000	82.49	79.56	86.92	155.49	156.86	200.89
100000	185.01	161.24	139.54	183.89	255.41	362.65
200000	410.11	330.52	335.14	294.42	412.61	675.6

Columns 2 through 8 show the time taken (in seconds), to build the models and combine the results.

Figure 7: Time Complexity Trends Using Clusters at UIC, UPENN and UMD

Table IV: Time Complexity - Data Transfer vs. Model Transfer (over 5 locations)

# Records	LAN	WAN	C45 Execution Time	LAN	WAN	LAN	WAN
	DT Time	DT Time		DT & C45	DT & C45	DT & C45	DT & C45

5000	0.6	1.661	9.25	9.85	10.911	17.6	21.1
10000	0.96	2.885	18.62	19.58	21.505	20.91	27.28
50000	3	11.96	156.81	159.81	168.77	86.78	79.56
100000	4.3	21.75	424.14	428.44	445.89	114.03	161.24

DT : Data Transfer

MT : Model Transfer

Columns 2 through 8 indicate the time in seconds.

Figure 8: Time Complexity - Data Transfer vs. Model Transfer (over 5 locations)

5.3 Initial Observations

Using our infrastructure (128 MB RAM), C4.5 could not handle 150,000 examples for single model approach; but we could run it on more than 200,000 examples using the multiple model approach.

5.3.1 Time Complexity

Multiple-model approach performs worse than the single model approach until a break-even point is reached, after which the ensemble always performs better, e.g., for 1000 training examples, the single model approach requires 5.05 seconds to build and validate the model while 5-models require 14.56 seconds. But, for 50,000 training examples, it takes 154.89 seconds to build a single model, while it takes only 86.78 seconds to build 5 models (Refer to figure 6).

The break-even point represents the state where the cost of distribution equals the gains of concurrency. Before this point, the cost of distribution is dominant and after this the concurrency gain factor is dominant.

The break-even point is a function of the number of models in the ensemble. As the numbers of models increase, the break-even point is reached at higher numbers of training examples. E.g., the break-even point is reached at 5,000 examples for 2 models but is reached after 10,000 examples for 4 models (Refer to figure 6).

On average, the ensemble approach appears to be more linear in terms of the numbers of training examples as compared to the single model approach.

The time complexity for ensemble modeling degrades over the wide area network but only to a very small and acceptable extent. It took 131.04 seconds to build three models on 100,000 examples over the local network of data sites (Refer to figure 6), but it took 185.01 seconds to build three models over the vBNS between UIC, UMD and UPENN (Refer to figure 7). Also, the difference between the LAN and WAN performance diminishes as the number of models in the ensemble is increased. This is due to the linearity of the ensemble learning mentioned above. Hence, initial experiments show that fast networks like vBNS are suitable for wide area distributed data mining.

The experiments show that moving the data to the query is an expensive operation, e.g., moving 50,000 training examples from 5 locations to a central site and building a model took 168.77 seconds while building models at 5 locations and transporting the models to the central site took 79.56 seconds over the WAN (Refer to figure 8).

5.3.2 Prediction Error Rates

When the number of training examples are low (1,000), the ensemble outperformed the single model in terms of prediction error percentage. E.g., for 1 model, the prediction error was 14.4 %, for 7 models it was 13.6 % and for 21 models it was 14.0 % (Refer to figure 5).

When the number of training examples was high (100,000), the single model performed slightly better than the ensemble. E.g., for 1 model, the prediction error was 12.8 %, for 7 models it was 12.844 % and for 21 models it was 13.128 % (Refer to figure 5).

The above results could be because of some combination of the following:

Decision trees are unstable learners, i.e., small variations in the training set can lead them to produce very different results.
It is possible that there is an erroneous data partition. As we increase the number of models, more models are affected by this erroneous data and hence it is reflected in the final result because of the simple majority voting scheme used.

The simple majority voting scheme could actually form a basis for detecting deviations or frauds. Since we know the scores of the trees built on each partition, we can detect if a partition is consistently out-voted each time in the voting process.

The simple majority voting scheme, which is essentially a model selection approach, may not be optimal for combining ensembles of models. Some model averaging approach will probably give optimal performance for ensembles.

5.4 Initial Conclusions

Empirical results show that the time complexity of building ensembles of decision tree classifiers is less after a break-even point as compared to the single model approach. This break-even point is a function of the number of models in the ensemble. Also, the time is more linear for the ensembles in terms of the number of the training examples used.

It is shown that the error rates of ensembles are less for small numbers of training examples but fare more poorly for large numbers of training examples. However there seems to be an ensemble size (between 5-7 models range) which consistently performs better then a single model.

An early conclusion is that there is a trade-off involved in using multiple-model learning for inherently distributed and hence disjoint datasets. If time complexity is a major concern in predictive modeling, then multiple-model learning is decidedly better than a single model approach. But, for prediction error rates, multiple-model learning is good only for a range of ensemble sizes and not for every ensemble.

6. Conclusions and Future Work

This thesis report introduced the architecture of Papyrus, a distributed, scalable and extensible system that supports the launching of learning agents to distributed sites over wide area networks.

The application design concepts embodied by the framework of Papyrus, i.e.,

SGML/XML based model interchange formats for the description of predictive models,

A resource management infrastructure using software agents and SGML based resource information format, and

meta-cluster approach,

provide an important step in developing systems that learn from massive data-sets in distributed and collaborative data mining environments.

The performance results of Papyrus on inherently distributed data indicate that it is a suitable workbench for experiments on multiple-model learning. The speed and accuracy trends of these experiments were tabulated.

While the current implementation of Papyrus uses SGML based MIFs (described by PMML 0.8 and C++ libraries), work is in progress on the next version, which uses XML (described by PMML 0.9 and Java libraries). Also specifications for more learning algorithms are sought to be incorporated in next release. The software agent system used currently is a Tcl system; Java agents if developed would provide more portability.

The performance results provide scope for future research, e.g., statistical analysis of the prediction error trends, etc. Also the communication issues involved in multiple-model learning (like the network bandwidth usage, etc.) will be investigated in future.

APPENDIX A: File Format for Single Predictive Models

The Structure of PMML documents

PMML 0.8 Documents start with a <!DOCTYPE> declaration followed by a MIF element containing a HEADER and then a MODEL element:

<!DOCTYPE PMML PUBLIC "-//LAC//DTD PMML 0.8 ">

<MIF>

<DATA-SCHEMA>

... other head elements

</HEADER>

<MODEL>

...elements specific to a model

</MODEL>

</MIF>

In practice, the MIF, HEADER and MODEL end tags can be omitted from the markup as these can be inferred in all cases by parsers conforming to the PMML 0.8 DTD.

Every conforming PMML 0.8 document must start with the <!DOCTYPE> declaration that is needed to distinguish PMML 0.8 documents from other versions of PMML. Every PMML 0.8 document must include DATA-SCHEMA in the HEADER block and a model specific part like CART-MODEL, ID3-MODEL, etc. in the MODEL block. A minimal PMML 0.8 document thus looks like:

<!DOCTYPE PMML PUBLIC "-//LAC//DTD PMML 0.8 ">

<MIF>

<DATA-SCHEMA>

... other head elements

<MODEL>

...elements specific to a model

The HEADER element

<! ELEMENT HEADER – O (DATA-SCHEMA & CREATION-INFORMATION?) >

This contains the document header, but you can always omit the end tag for HEADER. The contents of the document header is an unordered collection of the following elements:

DATA-SCHEMA

<!ELEMENT DATA-SCHEMA - -

(ATTRIBUTE-DESCRIPTOR, ATTRIBUTE-DESCRIPTOR+ ) >

Every PMML 0.8 document must have exactly one DATA-SCHEMA element in the document's HEADER. It provides the data-schema on modeled by the given MIF file. It must contain at least two attribute descriptors, one being the predicted attribute.

ATTRIBUTE-DESCRIPTOR

<! ELEMENT ATTRIBUTE-DESCRIPTOR - O ( mapping-function? ) >

<!ATTLIST ATTRIBUTE-DESCRIPTOR

NAME CDATA #REQUIRED

USE-AS (exclude | continuous | category | binary-category) #REQUIRED

DATA-TYPE ( real | integer | boolean | string ) #REQUIRED

It describes a single attribute of the data-schema. It can contain at most one mapping function. NAME specifies the name of the attribute, USE-AS specifies usage of this attribute in the data mining process and DATA-TYPE specifies the way attribute is stored in the database.

MAPPING-FUNCTION

<!ELEMENT MAPPING-FUNCTION - - CDATA >

<!ATTLIST MAPPING-FUNCTION TYPE CDATA #REQUIRED>

The mapping function describes the transformation to be performed on the attribute. The TYPE attribute indicates the language in which the mapping function is written.

CREATION-INFORMATION

<!ELEMENT CREATION-INFORMATION - -

( COPYRIGHT? & APPLICATION? & INDIVIDUAL? & TIMESTAMP? ) >

This is the information about how, when and by whom the model was created. It is optional and the tags in this sub tree are self-explanatory.

The MODEL element

<!ELEMENT MODEL - O

( CREATION-INFORMATION?, (CART-MODEL | REGRESSION-MODEL | ID3-MODEL ) ) >

<!ATTLIST MODEL

NAME CDATA #REQUIRED

TYPE (CART | C4.5 | OC-1) #REQUIRED

TRAINING-DATA-NAME CDATA #IMPLIED

TRAINING-DATA-SIZE NUMBER #IMPLIED

This contains the model specific part of the document. The end tag for MODEL may be omitted. The key attributes are: model name and the model type.

CREATION-INFORMATION

This field is same as what was described for the HEADER block.

The Model Specific Part

This block describes the details of a particular type of predictive model. We here present the MIF for the models we support.

C4.5

C45-MODEL

<!ELEMENT C45-MODEL - - ( (C45-NODE | C45-LEAF-NODE)+ ) >

<!ATTLIST C45-MODEL

...

C45-NODE

<!ELEMENT CART-NODE - O EMPTY >

<!ATTLIST CART-NODE

... >

C45-LEAF-NODE

<!ELEMENT CART-LEAF-NODE - O EMPTY>

<!ATTLIST CART-LEAF-NODE

…

>

CART

CART-MODEL

<!ELEMENT CART-MODEL - - ( (CART-NODE | CART-LEAF-NODE)+ ) >

<!ATTLIST CART-MODEL

TYPE ( binary-classification | classification | regression ) #REQUIRED

ATTRIBUTE-PREDICTED CDATA #REQUIRED

NUMBER-NODES NUMBER #REQUIRED

DEPTH NUMBE #REQUIRED

It marks the beginning of a cart-model. The attributes of this tag include, the TYPE of the CART model, the attribute that is predicted using this model, the number of nodes in the tree and the depth of the tree.

CART-NODE

<!ELEMENT CART-NODE - O EMPTY >

<!ATTLIST CART-NODE

NODE-NUMBER NUMBER #REQUIRED

ATTRIBUTE-NAME CDATA #REQUIRED

LEFT-CHILD NUMBER #REQUIRED

RIGHT-CHILD NUMBER #REQUIRED

CUT-VALUE CDATA #REQUIRED

This denotes a non-leaf node in the tree. Its attributes are the node number, the attribute name associated with the node, the node numbers of its left and right children and the cut value.

CART-LEAF-NODE

<!ELEMENT CART-LEAF-NODE - O EMPTY>

<!ATTLIST CART-LEAF-NODE

NODE-NUMBER NUMBER #REQUIRED

SCORE CDATA #REQUIRED

This denotes a leaf node in the tree. Its attributes are the node number and the class value associated with it.

c) ID3

ID3-MODEL

<!ELEMENT ID3-MODEL - - ( (ID3-NODE | ID3-LEAF-NODE)+ ) >

<!ATTLIST ID3-MODEL

ATTRIBUTE-PREDICTED CDATA #REQUIRED

NUMBER-NODES NUMBER #REQUIRED

DEPTH NUMBER #REQUIRED

It marks the beginning of a id3-model. The attributes of this tag include, the attribute that is predicted using this model, the number of nodes in the tree and the depth of the tree.

ID3-NODE

<!ELEMENT ID3-NODE - O EMPTY >

<!ATTLIST ID3-NODE

NODE-NUMBER NUMBER #REQUIRED

ATTRIBUTE-NAME CDATA #REQUIRED

CUT-VALUE CDATA #REQUIRED

LEFT-CHILD NUMBER #REQUIRED

RIGHT-SIBLING NUMBER #REQUIRED

This denotes a non-leaf node in the tree. Its attributes are the node number, the attribute name associated with the node, the node numbers of its left child and right sibling and the cut value.

ID3-LEAF-NODE

<!ELEMENT ID3-LEAF-NODE - O EMPTY>

<!ATTLIST ID3-LEAF-NODE

NODE-NUMBER NUMBER #REQUIRED

CUT-VALUE CDATA #REQUIRED

SCORE CDATA #REQUIRED

RIGHT-SIBLING NUMBER #REQUIRED

This denotes a leaf node in the tree. Its attributes are the node number, the cut value, the class value and the node number of its right sibling associated with it.

d) LINEAR REGRESSION

LINEAR-REGRESSION-MODEL

<!ELEMENT LINEAR-REGRESSION-MODEL - O ( LINEAR-REGRESSION-COEFFICIENT )+ >

<!ATTLIST LINEAR-REGRESSION-MODEL

DIMENSION CDATA #REQUIRED

It marks the beginning of a linear regression model. The attribute of this tag is the dimension of the model.

LINEAR-REGRESSION-COEFFICIENT

<!ELEMENT LINEAR-REGRESSION-COEFFICIENT - - EMPTY>

<!ATTLIST LINEAR-REGRESSION-COEFFICIENT

COEFFICIENT-POSITION CDATA #REQUIRED

COEFFICIENT-VALUE CDATA #REQUIRED

This tag gives the position and the coefficient value for that position in the linear regression model.

A Sample MIF Document

<!DOCTYPE PMML PUBLIC "-//LAC//DTD PMML 0.8 ">

<mif>

<creation-information>

</copyright>

<postal-address street1='100 South Wacker Drive' street2='Suite 1130' city='Chicago' state='Il' country='US' zipcode='60606-4003'>

</organization>

</application>

</creation-information>

<data-schema>

<attribute-descriptor name='card number' use-as='exclude' data-type='string'>

<attribute-descriptor name='timestamp' use-as='exclude' data-type='string'>

<attribute-descriptor name='dollar amount' use-as='continuous' data-type='real'>

<attribute-descriptor name='issuer' use-as='category' data-type='integer'>

<attribute-descriptor name='acquirer' use-as='category' data-type='integer'>

<attribute-descriptor name='sic code' use-as='category' data-type='integer'>

<attribute-descriptor name='fraud indicator' use-as='boolean' data-type='binary'>

<mapping-function type='perl'>

sub Char_Match_indicator

{

# the input will always be a scalar found in @_[0]

$input = @_[0];

# for debugging purposes, print out the input value

# printf ( "input is $input\n" );

# scrub char value into a boolean

if ( $input eq "G" )

{

$output = 0;

}

elsif ( $input eq "X" )

{

die "unconsumated record found in database\n";

}

else

{

$output = 1;

}

# for debugging purposes, print out the output value

# printf ( "output is $output\n" );

return $output;

}

</mapping-function>

<attribute-descriptor name='derived attribute 0' use-as='continuous' data-type='real'>

<attribute-descriptor name='derived attribute 1' use-as='continuous' data-type='real'>

<attribute-descriptor name='derived attribute 2' use-as='continuous' data-type='real'>

</data-schema>

<model-weights>

</model-weights>

</header>

<cart-model type='binary-classification' attribute-predicted='Fraud Indicator' number-nodes='13' depth='3'>

<cart-node node-number='0' attribute-name='derived attribute 3' cut-value='-0.158746' left-child='1' right-child='6'>

<cart-node node-number='1' attribute-name='derived attribute 65' cut-value='-0.273673' left-child='2' right-child='3'>

<cart-leaf-node node-number='2' score='0'>

<cart-node node-number='3' attribute-name='derived attribute 19' cut-value='-0.641615' left-child='4' right-child='5'>

<cart-leaf-node node-number='4' score='0'>

<cart-leaf-node node-number='5' score='1'>

<cart-node node-number='6' attribute-name='derived attribute 17' cut-value='-0.082279' left-child='7' right-child='10'>

<cart-node node-number='7' attribute-name='sic code' cut-value='5967' left-child='8' right-child='9'>

<cart-leaf-node node-number='8' score='1'>

<cart-leaf-node node-number='9' score='0'>

<cart-node node-number='10' attribute-name='derived attribute 27' cut-value='0.583846' left-child='11' right-child='12'>

<cart-leaf-node node-number='11' score='0'>

<cart-leaf-node node-number='12' score='1'>

</cart-model>

</model>

</mif>

APPENDIX B: File Format for Ensemble of Models

In this Appendix we propose extensions to the above described PMML, thereby incorporating ensemble of models built on the same data-schema in a single PMML document. The only modification occurs in the HEADER block where a tag called MODEL-WEIGHTS is introduced. The new HEADER block looks like

<! ELEMENT HEADER – O

(DATA-SCHEMA & CREATION-INFORMATION? & MODEL-WEIGHTS?) >

MODEL-WEIGHTS

<!ELEMENT MODEL-WEIGHTS - - ( WEIGHT + ) >

Model weighting information must contain one weight per model in the document. If there is only one model then this is omitted.

WEIGHT

<!ELEMENT WEIGHT - O EMPTY >

<!ATTLIST WEIGHT

MODEL-NAME CDATA #REQUIRED

WEIGHT-VALUE CDATA #REQUIRED >

The weight gives the weight assigned to a model.

The MODEL block now contains one more models conforming to their structure as defined in the Document Type Definition.

<MODEL>

<CART-MODEL>

information pertinent to cart-model

</CART-MODEL>

…

information pertinent to linear regression

</LINEAR REGRESSION>

</MODEL>

APPENDIX C: File Format for Resource description

The Structure of RIF documents

A RIF document starts with <!DOCTYPE> declaration followed by a RIF element containing one or more CLUSTER element:

<!DOCTYPE RIF PUBLIC "-//LAC//DTD RIF 0.8">

<RIF>

… other cluster elements

</CLUSTER>

…

</RIF>

In practice, the RIF tag can be omitted from the markup. The RIF document should have atleast one CLUSTER element and each CLUSTER element should have at-least one DOMAIN element.

CLUSTER element

<!ELEMENT CLUSTER - O (DOMAIN+, DM-ENGINE*) >

<!ATTLIST CLUSTER

NAME CDATA #REQUIRED

POINT-OF-ENTRY CDATA #REQUIRED >

The CLUSTER element represents one cluster and contains atleast one DOMAIN element and zero or more data mining engines present at that cluster, represented by DM-ENGINE element. NAME attribute specifies the global-name of the cluster and POINT-OF-ENTRY represents the cluster-point-of-entry.

DOMAIN element

<!ELEMENT DOMAIN - O (DATASET+) >

<!ATTLIST DOMAIN

NAME CDATA #REQUIRED >

This represents the domain of application like Medical, Physics etc. as represented by NAME attribute. Each domain consists of one or more DATASET elements.

DATASET

<!ELEMENT DATASET - O ((TABLE,MODEL*)+ >

<!ATTLIST DATASET

NAME CDATA #REQUIRED >

Each dataset consists of one or more tables and the associated predictive models. NAME attribute specifies the global name of the dataset.

TABLE

<!ELEMENT TABLE - O (SCHEMA) >

<!ATTLIST TABLE

NAME CDATA #REQUIRED >

The TABLE element consists of schema for the table. NAME attributes specifies the name of the table.

SCHEMA

<!ELEMENT SCHEMA - O (ATTRIBUTE-DESCRIPTOR, ATTRIBUTE-DESCRIPTOR+) >

The data schema of a table is denoted using a set of attribute-descriptors. There should be atleast two attributes in each table for any data discovery purposes.

ATTRIBUTE-DESCRIPTOR

<! ELEMENT ATTRIBUTE-DESCRIPTOR - O ( mapping-function? ) >

<!ATTLIST ATTRIBUTE-DESCRIPTOR

NAME CDATA #REQUIRED

USE-AS (exclude | continuous | category | binary-category) #REQUIRED

DATA-TYPE ( real | integer | boolean | string ) #REQUIRED

MODEL

<!ELEMENT MODEL - O (#PCDATA) >

<!ATTLIST MODEL

NAME CDATA #REQUIRED

TYPE CDATA #REQUIRED >

This represents a predictive model built on the associated data schema. The NAME attribute specifies the global name of the model and TYPE specifies the predictive model type e.g. C4.5 or CART for decision trees.

DM-ENGINE element

<!ELEMENT DM-ENGINE - O (HEADER,MODEL-TYPE+) >

DM-ENGINE represents the data discovery tools present at a cluster. Each data mining engine consist of some HEADER information and the nature of the discovery tools represented by the MODEL-TYPE element. A DM-ENGINE can have one or more types of data discovery tools.

HEADER

<!ELEMENT HEADER - O (#PCDATA) >

<!ATTLIST HEADER

NAME CDATA #IMPLIED

VERSION CDATA #IMPLIED

TIMESTAMP CDATA #IMPLIED

It contains some optional information about the data mining engine like the name of the implementor, version number, copyright details and the timestamp.

MODEL-TYPE

<!ELEMENT MODEL-TYPE - O (#PCDATA) >

MODEL-TYPE represents the name of a data discovery tool like C4.5 or CART.

CITED LITERATURE

ANSI X3H4 (1985). American National Standard Information Resource Dictionary System: Part 1-Core Standard. ANSI X3H4, American National Standards Institute, New York, 1985.

Ashok R. (1998). Incorporating transportable software agents into a wide area high performance distributed data mining system. Thesis report, Department of Electrical Engineering and Computer Science, University of Illinois at Chicago, 1998.

Brieman L. (1996). Bagging Predictors. Machine Learning, Volume 24, Number 2, pages 123-140.

Brieman L., Friedman J., Olshen R. and Stone C. (1984). CART: Classification and Regression Trees. Belmont, California, Wadsworth.

BNIF (1997). Proposal for a Bayesian Network Interchange Format. Decision Theory Group, Microsoft Research, 1996.

Buneman P., Davidson S., Grossman R. and Maier D. (1995). Interoperating with Non-Database Data. University of Pennsylvania, Computer Science Department Technical Report, 1995.

Chan P. and Stolfo S. (1995). A Comparative Evaluation of Voting and Meta-Learning on Partitioned Data. Proceedings of the Twelfth International Conference on Machine Learning, pages 90-98.

Chen M.S., Han J. and Yu P.S. (1996). Data Mining: An overview from Database Perspective.

Crawley S., Davis S., Indulska J., McBride S., and Raymond K. (1997). Meta Information Management. Formal Methods for Open Object-based Distributed Systems (FMOODS) Conference, Canterbury, UK, July 1997.

Domingos P. (1997). Why does Bagging Work ? A Bayesian Account and its Implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 155-158.

Drucker H. (1997). Fast Committee Machines for Regression and Classification. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 159-162.

Gray J. (1996). Data Management: Past, Present, and Future. IEEE Computer’s 50^th Anniversary issue.

Gray R., Rus D. and Kotz D.(1996) Transportable Information Agents. Department of Computer Science, Dartmouth College, Technical Reports, 1996.

Gray R.(1996). Agent-Tcl, PhD Thesis . Department of Computer Science, Dartmouth College, 1996.

Grossman R.L., Hanley D. and Qin X. (1994). Ptool: A Light Weight Persistent Object Manager. Proceedings of SIGMOD 94, ACM 1994, pp 510.

Grossman R.L., Bodek H., Northcutt and Poor V. (1996). Data Mining and Tree-based Optimization. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Simoudis E., Han J. and Fayyad U., editors, AAAI Press, Menlo Park, 1996, pp 323-326.

Grossman R.L. (1997a). Class lectures for MISI Workshop. University of Illinois at Chicago, Spring 1997.

Grossman R.L. (1997b). Model Interchange Format for Predictive Models. Magnify Press Release, July 1997.

Grossman R.L (1998). Data Mining: Two Cultures. Magnify Technical Report 98-1, submitted for publication.

Grossman R.L and Cubed M. (1998). A Framework for Integrating Data Management, Data Mining and Predictive Modeling within the Knowledge Discovery Process. Magnify Technical Report 98-2, submitted for publication.

Grossman R.L., Bailey S., Pulleyn I., Hallstrom P., Qin X., Ramu A. and Malhi B. (1998a). The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML). Submitted for publication.

Hsu C., Bouziane M., Rattner L. and Yee L. (1991). Information Resources Management in Heterogeneous, Distributed Environments: A Metadatabase Approach. IEEE Transactions on Software Engineering, 1991.

Kargupta H., Hamzaoglu I. And Stafford B. (1997). Scalable Distributed Data Mining – An Agent Architecture. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 211-214.

Jaap J.P. and Davis E.K.(1994). A Format for the Interchange of Scheduling Models. 3^rd International Symposium on Artificial Intelligence, Robotics, and Automation for Space, Jet Propulsion Laboratory, Passadena, 1994.

Maxwell T. (1997). A Meta-Model Approach to Modular Simulation. Technical Report, Institute for Ecological Economics, University of Maryland, 1997.

MDC (1997). The Meta Data Interchange Specification Initiative: Goals and Charter. Meta Data Coalition, http://www.he.net/~metadata, June 1997.

Mon D. (1997). Data Warehouses, Data Marts and Data Mining. Class Lectures for HIS course, University of Illinois at Chicago, Fall 1997.

Ousterhout K.(1994). Tcl and Tk Toolkit, in Addison-Wesely, Reading, Massachusetts, 1994.

Parsaye K. (1996). DataMines for DataWarehouses. Database Programming and Design, September 1996.

Parsaye K. (1997a). OLAP and Data Mining: Bridging the Gap. Database Programming and Design, February 1997.

Parasaye K (1997b). A Characterization of Data Mining Technologies and Processes. White Paper, Information Discovery, Inc., 1997.

PMML (1998). The Predictive Model Markup Language Version 0.9, The Data Mining Group.

PowerSim (1997). The PowerSim Corp. page. http://www.powersim.no., 1997.

Provost F. (1997). Scaling Up Inductive Algorithms: An Overview. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 239-242.

Puller D., and Stock K. (1997). Geospatial Modeling: A Case Study for a Statewide Land Information Strategy. Technical Report, The University of Queensland, 1997.

Quinlan J. Ross (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, Inc.

Raftery E., Madigan D. and Hoeting J.A. (1996). Bayesian Model Averaging for Linear Regression Models. Submitted for publication.

Ramasamy K. (1998). Object Relational Data Warehouse. Masters Project report, Laboratory for Advanced Computing, University of Illinois at Chicago, 1998.

SGML (1986). Information Processing – Text and Office Systems – Standard Generalized Markup Language (SGML). ISO 8879:1986.

Shekhar S., Coyle M., Goyal B., Liu D. and Sarkar S. (1997). Data Models in Geographic Information Systems. Communications of the ACM 40(4), 103-111.

Stolfo S., Prodromidis A., Tselepis S., Lee W and Fan D. (1997). JAM: Java Agents for Meta-Learning over Distributed Databases. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 74-81.

Tenanbaum A. (1996). Computer Networks: Third Edition. Prentice Hall Inc., New Jersey, 1996.

Tukey J. (1973). Exploratory Data Analysis. New York: McMillan, 1973.

W3 XML (1997). Working Draft Specification for XML. http://www.w3.org/pub/WWW/TR/WD-xml-lang-970331.html.

Wolpert (1992). Stacked Generalization. Neural Neworks, Volume 5, pages 241-259.

Providing Support for Resource Management Tools in a Wide Area High Performance Distributed Data Mining System

by Balinder Singh Malhi EECS 598 : Master's Thesis in Computer Science Advisor : Prof. Robert Grossman May 07, 1998 Laboratory for Advanced Computing University of Illinois at Chicago.

Summary of Work

Acknowledgements

TABLE OF CONTENTS

by
Balinder Singh Malhi
EECS 598 : Master's Thesis in Computer Science

Advisor : Prof. Robert Grossman

May 07, 1998
Laboratory for Advanced Computing
University of Illinois at Chicago.