[This local archive copy is from the official and canonical URL, http://lac.uic.edu/~balinder/thesis.htm, text only; please refer to the canonical source document if possible.]
In this thesis report, a framework for predictive modeling in distributed and collaborative data mining environments is detailed. The framework uses meta-model and meta-cluster approaches and is a suitable test-bed for experiments on ensemble learning, partitioned learning and meta-learning. Two components of the framework, namely Model Interchange Format (MIF) formalism for exchanging predictive models and Resource Information Format (RIF) formalism for management of resources are described. The performance analysis of a system implementing this framework is presented as well.
This framework has been developed to meet three pressing needs. (1) to support distributing a data mining task among different datasites . Such a scenario is possible because of the inherent distribution of data in applications or due to the inapplicability of the current generation of learning algorithms to very large data-sets; (2) to allow for collaborative data mining, i.e., the inter-operability of data mining systems; and (3) to support benchmarking of models and techniques.
Model interchange format is the meta-model used to describe predictive models. This meta-model facilitates the migration and easy ingestion of these models at geographically remote sites. The Meta-cluster approach tackles the problem of distributed and large data-sets by providing an infrastructure for the management and transparent communication of resources between sites.
The guiding principles and the specifications of an SGML/XML based model interchange format and resource information format are presented. These specifications are the result of meetings between the representatives of member organizations of the Data Mining Group (DMG).
Papyrus, a distributed and scalable data mining system that employs general meta-model and meta-cluster approaches, has been implemented as a proof of concept of the framework. Papyrus provides a set of learning programs, implemented as C++ applications, which can compute predictive models over data stored locally at several distributed clusters/sites. Papyrus also provides a set of software agents and directory services for combining multiple models that were learned (perhaps) at different sites.
The overall architecture of the Papyrus system and the specific implementation currently under development at the Laboratory for Advanced Computing (LAC), University of Illinois at Chicago (UIC), are presented. The experimental setup and results of performance tests for speed and accuracy analysis using Papyrus for inherently distributed data are presented.
One of Papyrus's target applications uses healthcare data obtained from the University Health Consortium (UHC). This application was one of LAC's demonstrations at the Internet 2 conference at Washington DC (Oct. 7-9, 1997) and also at Supercomputing 97 (Nov. 16-21, 1997) in San Jose, CA. Interested readers can learn more about this project at http://www.lac.uic.edu/~balinder/medical/sc97/top.html.
I would like to thank my thesis advisor, Dr. Robert Grossman, for introducing me to data mining and for all his guidance and support. This thesis work was enabled and sustained by his vision and ideas.
Stuart Bailey is due a special note of thanks for his unique way of inspiring students through clarity of thought, enthusiasm and caring. He is a model to all laboratory students. His technical excellence, unwavering faith and constant encouragement were very helpful and made this effort an enjoyable one.
I wish to thank Dr. Donald Mon for providing us with the healthcare data and introducing us to issues in the healthcare management.
I wish to thank Dr. Simon Kasif for his guidance and comments on the experiments.
I am grateful to Shirley Connelly and David Turkington for proof reading this report. Their comments and suggestions were very valuable as the thesis progressed.
A number of individuals in the Laboratory for Advanced Computing were very helpful and understanding. I would like to acknowledge Ashok Ramu, Marco Mazzucco and Srinath Gutti in particular for their contribution to this work.
1. Introduction *
1.1 Predictive Modeling Culture in Data Mining
1.2 Current Challenges in Predictive Modeling
*1.3 A historical and philosophical perspective
*1.4 Scope of this work
*2. PMML : A Meta-Model Formalism for Predictive Models *
2.1 What is a Meta-Model Approach ?
*2.2 Approaches to Model Specification
*2.3 Related Modeling efforts
*2.4 Design Goals for Model Interchange Format
*2.5 SGML/XML based formalism for Model Interchange Format
*2.6 Advantages of using SGML/XML based formalism
*2.7 File format for Single Predictive Models
*2.8 File format for Ensemble of models
*3. Resource Information Management in heterogeneous, distributed data mining environments *
3.1 Characteristics of the Underlying Architecture
*3.2 Unified Global Resource Management Model
*3.3 Role of Meta-data in Emerging Systems
*3.4 Logical Components of Metadata
*3.5 Design Goals for Resource Information Format
*4. Papyrus: Philosophy, Design and Architecture *
4.1 Meta-Cluster Approach
*4.2 Papyrus Architecture
*4.3 Major Components of Papyrus
*4.4 Advantages of Papyrus Architecture
*5. Performance Analysis of the Multiple-Model Learning Approach for Inherently Distributed Data-Sets using Papyrus *
5.1 Inherently distributed data
*5.2 Experiments and Results
*5.2.1 Goals
*5.2.2 Setup
*5.2.3 Time and Accuracy Results
*5.3 Initial Observations
*5.3.1 Time Complexity
*5.3.2 Prediction Error Rates
*5.4 Initial Conclusions
*6. Conclusions and Future Work *
APPENDIX A: File Format for Single Predictive Models *
APPENDIX B: File Format for Ensemble of Models *
APPENDIX C: File Format for Resource description *
CITED LITERATURE *
1. Introduction
This chapter introduces the predictive modeling culture in data mining and addresses the research challenges of predictive modeling in distributed and collaborative data mining environments. It also provides a brief historical and philosophical perspective on the evolution of these challenges and concludes with the scope and motivation of this work.
1.1 Predictive Modeling Culture in Data Mining
Formally, data mining can be defined as the process of automatic extraction of patterns, associations, changes and anomalies in large data sets (Grossman, 1997a, unpublished data). This process involves some search, which may be done by the user or may be assisted by a smart program that automatically searches the database for significant patterns. The issues raised in some data mining problems over large data-sets could be equally relevant for small data-sets as well (Grossman, 1997a, unpublished data).
Traditionally there have been two types of statistical analyses: Confirmatory Analysis and Exploratory Analysis (Parsaye, 1997a). In confirmatory analysis, also called verification model, the user has a hypothesis and either confirms or refutes it. In exploratory analysis, also known as discovery model, as pioneered by John Tukey (Tukey, 1973), one finds suitable hypotheses to confirm or refute. Here the system takes the initiative in data analysis, not the user. Data mining seems to fit in the mold of exploratory analysis by our definition of the problem. The system is expected to initiate the questions by itself.
From the process-oriented view, there are three classes of data mining activities: discovery, predictive modeling and forensic analysis as shown in the figure 1. This figure is adapted from (Parsaye, 1997b).
Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be (Parsaye, 1997b). This deals with the application of various types of approaches and algorithms to find these patterns or models in the data. These include logic methods (rules, decision trees, etc.) and equational methods (statistics, neural nets, etc.).
In Predictive Modeling, models discovered from the data-set are used to predict the future. Predictive Modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on the previous models discovered from the data-set. A predictive model could be a classifier like a decision tree or linear regression.
Forensic Analysis is the process of applying the extracted models to find anomalous or unusual data elements. The focus of this thesis is on predictive modeling.
1.2 Current Challenges in Predictive Modeling
One means of acquiring new knowledge/models from databases is to apply various machine learning algorithms that compute descriptive representations of data as well as patterns that may be exhibited in the data. Chen, et al. (1996) have presented an excellent overview of the range of machine learning algorithms and their applicability. Also there is excellent research literature available on the scalability of algorithms over large data-sets (Parsaye 1996, Provost 1997, and Chan and Stolfo 1995). However, new challenges arise when the data is distributed and when there is a need for the inter-operability of systems.
Two requirements dictate distributed data mining, i.e., building predictive models over local data-sets on remote sites and then using model selection or averaging to combine the results at a central site.
Sometimes, data is inherently distributed. An example is the Health Care database of UHC (University Health Consortium), which is a consortium of 60 university hospitals. It is possible to transport the data to one central site and build one predictive model over this data-set. This model may provide the best possible accuracy, limited only by the algorithm used. But when data-sets are very large, the cost of transporting data to a single site is usually high and, at times, unacceptable.
At other times, data is artificially distributed. This could be because many of the existing learning algorithms require all the data to be resident in memory, which is clearly untenable in many real-world data-sets. Provost (1997) has suggested data partitioning as one of the solutions to scaling up such algorithms. Also Domingo (1997) and Drucker (1997) have shown that the error rate of classifiers like trees can be reduced by learning multiple models and combining the results. This multiple-model approach is often referred to as ensemble learning, partitioned learning, distributed learning or meta-learning based on the context. (Grossman, 1998a).
Hence, the underlying approach in these distributed data mining environments is to build predictive models locally and then transport the models to a central site. This approach is usually expected to be more efficient in terms of run-time and network traffic costs. Since models are built in parallel across remote sites, run-time costs are reduced. Also, since serial code is executing at each site, it is not necessary to write parallel code that needs to coordinate processes across the network, though parallelism at local sites can still be exploited. Network costs are less because the models are expected to be smaller than the data-sets. However, the challenge of combining the models using selection or averaging for maximum prediction accuracy is a fairly serious one.
The challenge here is: 1) How to inter-operate the vast number of the data mining systems being used by the people today and, 2) how to perform benchmarking tests.
"The proprietary nature of the available data mining system is inhibiting the growth of the data mining industry as a whole." (Grossman, 1997b)
The above observation indicates the need for collaborative modeling in data mining. Realistic data mining systems are becoming much too complex for any single group of researchers to implement single-handedly, requiring collaboration between people from other specializations. There is a need for the exchange and incorporation of predictive models between data mining systems, probably on different hardware/software platforms.
Communicating the structure of the model to others can become an insurmountable obstacle to collaboration and acceptance of the predictive model.
The above two requirements necessitate the need for a distributed data mining framework, which provides a mechanism for managing and transporting predictive models between geographically remote sites. There is also a need for a universally accepted common description format for these predictive models that allows for easy exchange and incorporation of models between systems, perhaps, from different vendors.
This description format can also form the basis for benchmarking tools and standards, and may be important for the commercial acceptance of data mining products. According to David Clark’s theory of apocalypse of two elephants shown in figure 2 (Tenanbaum, 1996), it is essential that reasonable standards be formed before the onrush of commercial investments.
1.3 A historical and philosophical perspective
To understand the current challenges better, it is both interesting and useful to trace the evolution of data mining systems.
Computers can now store all forms of information: records, documents, images, sound recordings, videos, scientific data, and many other data formats. Computer scientists have made great strides in capturing, storing, managing and visualizing this data. These tasks are generally called data management.
There have been several distinct phases in data management (Gray R, 1996). Initially, data was manually processed, with the first known records dating back to 4000 BC in Sumeria for keeping track of royal assets and taxes. The first practical programmatic information processing began in the early twentieth century with the introduction of punched-cards and electro-mechanical computers to sort and tabulate millions of records. The third phase stored data on magnetic tape and used stored- program computers to perform batch processing on sequential files. The fourth phase automated the concept of a database schema and online navigational access to the data (Gray R, 1996). The fifth step automated access to relational databases and added distributed and client-server processing. We are now in the sixth generation systems and these systems not only store much richer kinds of data (like images, voice and video data), but provide specialized features like data warehouses and data marts to enable fast access to relevant data. A Data Warehouse (Grossman, 1997a and Mon, 1997) is a data management system structured to facilitate the analysis of data. It is here that all the corporate data is stored, in one place using normalized data structures. A Data Mart (Mon, 1997) is a place where departmental data is stored and often external data items are added. They mostly deal with the data space.
While all the advances in the data management technologies have allowed us to store huge amounts of data, a parallel problem is the problem of data analysis on these large data-sets. Also at times, data is inherently distributed and cannot be localized on any one machine. Distribution adds one more dimension to the data analysis problem.
The Earth Observation System / Data Information System (EOS/DIS) (Gray R, 1996 ) being built by NASA is expected to accumulate 5 terabytes of remote sensor data each day. There are other scientific, government and commercial databases that are growing at a rapid pace . An example of the distributed database is the TIMSS project student database (TIMSS, 1996).
Current database search techniques, like SQL reporting or spreadsheets, do not scale well to these large and distributed data sets. Users expect not only automated management of these data-sets, they also expect "simple and powerful automated tools" to browse, search, analyze and visualize the data.
"These requirements are stretching the limits of what we know how to do today." (Parsaye, 1996)
The above problem is not just a database problem. It has attracted people from such diverse backgrounds as machine learning, statistics, computational science, control theory and many more, to apply their skills and techniques, in a unified goal to develop automated tools and techniques for analyzing huge data-sets. After the expected debates on techniques from various fields, there is a convergence of ideas on the nature of the problem and also the applicability of solutions. A new field known as Data Mining has emerged from these debates.
This thesis work is aimed at proposing a framework for predictive modeling in distributed and collaborative environments. The focus is to attack the problem from the application design viewpoint.
The following forces motivate this work. We believe that the amount of research into application level design has been lacking in the field of data mining. A great deal of attention within the data mining community has focussed on the discovery part, i.e., the learning algorithms and other related issues. While this is very important and serious work, it is important that efforts be continued with equal vigor at the application design end of the data mining spectrum.
Meta-Model and Meta-Cluster approaches are proposed to handle mining of distributed data within the same system, or to allow for inter-operability of different systems.
The meta-model approach deals with predictive models. The emphasis is not on the process of model formation, but on the description of predictive models. This approach seeks to use a declarative formalism (based on SGML/XML), for describing the predictive models. This declarative Model Interchange Format (MIF) is described using a language known as the PMML (Predictive Model Markup Language) (Grossman et al., 1998 and PMML, 1998).
A meta-cluster approach is being proposed as a possible solution to the problem of distributed data mining architectures over the wide area. This seeks to compute a number of independent classifiers by applying learning programs to a collection of independent and geographically distributed databases in parallel. The results of the independent classifiers (i.e., derived models) are then combined using the PMML.
A distributed resource management infrastructure that uses Tcl based software agents (Gray et. al., 1996) and SMGL based meta-data representation is presented. The core component of the infrastructure, i.e., Resource Information Format (RIF) is discussed in detail.
These approaches are demonstrated using a prototype implementation, Papyrus. The performance results of Papyrus on distributed data for multiple-model learning purposes are detailed. We have used healthcare data obtained from University Health Consortium (UHC) and C4.5 learning algorithm (Quinlan, 1993) for experiments.
The rest of the work is presented as follows:
Chapter 2 discusses the idea of meta-models based on Markup Languages like SGML/XML and also describes the prototype MIFs developed using PMML.
Chapter 3 describes the proposed resource information management infrastructure for distributed data mining architectures and details its core element, RIF.
Chapter 4 presents the architecture and design of the prototype system, Papyrus, along with the implementation details.
Chapter 5 details the experimental setup and the results of performance tests on the UHC patient database using Papyrus architecture.
The report concludes with a discussion of our experiences with the current work and directions for future work in Chapter 6.
2. PMML : A Meta-Model Formalism for Predictive Models
In the previous chapter, a case was made for the necessity of exchanging predictive models in distributed and collaborative environments. In this chapter, a meta-model approach for the description and, hence, exchange of models is proposed. The design guidelines and specifications of a model interchange format based on this meta-model are presented. This MIF uses PMML as the description language. In the next chapter, other controls needed for exchange of models in a distributed framework are discussed.
2.1 What is a Meta-Model Approach ?
Our experience with data mining systems has indicated that users want semantically expressive models for their particular application domain, e.g., decision trees for medical data or Naïve Bayes algorithms for text mining. On the other hand, vendors need to provide general purpose tools. Hence from the application design viewpoint there is a need to concentrate on describing meta-models from which different predictive models can be built.
If predictive models are representations of some learning concept, then meta-models are the description of predictive models. Meta-models use a language to define all well-formed models that may be represented. Similarly, in order to flexibly describe a meta-model, a meta-metamodel is required. A meta-metamodel defines an abstract language for specifying meta-models. A meta-model is an instance of a meta-metamodel, which is a language for specifying models. This approach was first detailed by David Pullar and Kristin Stock (Puller and Stock, 1997) for geospatial modeling. The meta-model in our case is the model interchange format (MIF) and the language it uses to define well-formed predictive models is the Predictive Model Markup Language (PMML). The meta-metamodel uses the abstract Languages SGML/XML for specifying instances of meta-models (e.g., PMML, HTML, etc).
The meaning of these abstract concepts is depicted in figure 3 using an analogy to the CORBA (CORBA, 1995) and HTML systems. This figure is adapted from Crawley, et al. (1997).
If a meta-metamodel describes a type system for CORBA IDL, then the meta-model would be certain schemas of CORBA IDL types, and the entities in the model would be CORBA objects. Similarly, a meta-metamodel would be an SGML/XML system for both text documents and predictive models. For the text documents, the meta-model, i.e., Text Interchange Format (TIF) uses HTML as the defining language. On the other hand, our system for predictive models uses PMML as the language for describing the Model Interchange Format meta-model.
At an abstract level any meta-model consists of two parts: 1) a model specification and 2) a model executable. The model specification is the description of the model that is used to generate the model instance or executable. Even though we are using SMGL/XML based declarative formalism for predictive model specification, in general three approaches are possible. Section 2.2 details these approaches and section 2.3 discusses the related modeling efforts. In section 2.5, design goals of our approach are explained.
This thesis deals with the specification requirements of the predictive models, leaving the model executable duties or choices to the various people exchanging the models.
2.2 Approaches to Model Specification
There have been few efforts to date in the data mining community for the modeling of the predictive models.
In the summer of 1996, the program co-chairs of the Twelfth Conference on Uncertainty in Artificial Intelligence (UAI-96) in Portland arranged a general meeting on the Bayesian Network Interchange Format through the efforts of the Decision Theory Group at Microsoft Research and a team at the University of Aalborg in Denmark. The format that came out of these efforts uses a BN mini-language (BNIF, 1997) that has borrowed many features from the C/C++ language and includes a parser written in YACC. This formalism is a kind of procedural formalism.
There is a lot to be learned from the modeling efforts in some other areas as well. Scientists at the Marshall Space Flight Center (Jaap and Davis, 1994) at NASA have developed a model interchange format for scheduling models among the community of space activity schedulers. This format is declarative in nature but proprietary to NASA. Also, PowerSim Corporation has developed some declarative model interchange formats for their System Dynamics simulator (PowerSim, 1997).
2.4 Design Goals for Model Interchange Format
Four significant requirements drove the formulation of the model interchange format for predictive models. These were that any MIF must be universal, portable, human readable and extensible. Since the format was developed chiefly for the interchange of models, rather than for the storage and manipulation of models within data mining engines, these requirements were deemed to be more important than efficiency.
2.5 SGML/XML based formalism for Model Interchange Format
In view of the above design goals, a declarative formalism based on SGML/XML was chosen for predictive model specifications. Predictive Model Markup Language (PMML) is used to define the structure and intent of the predictive model. PMML is a simple markup language that uses SGML/XML as its meta-language in a manner similar to the Hypertext Markup Language (HTML) used for world wide web documents. SGML is International Standard ISO 8879 -- Standard Generalized Markup Language (SGML, 1986) and XML is eXtended Markup Language (W3 XML, 1997).
"The predictive models described using PMML consist of several parts: 1) a header, 2) a data schema, 3) a data mining schema, 4) a predictive model schema, 5) definitions for predictive models, 6) definitions for ensembles of models, 7) rules for selecting and combining models and ensembles of models and 8) rules for exception handling. Component 5) is required. In addition a schema for the predictive model must be defined. This can be done using one or more of the schemas – components 3, 4 and 5. The other components are optional."(Grossman, 1998a).
PMML 0.8 uses SGML as its meta-language and a C++ based parser library (SP) to provide facilities or routines for generating and validating MIF documents. PMML 0.9 uses XML and a Java based parser library.
As an SGML application, the syntax of conforming PMML 0.8 documents is defined by the combination of the SGML declaration and the document type definition (DTD). This specification defines the intended interpretation of PMML 0.8 elements, and places further constraints on the permitted syntax which are otherwise inexpressible in the DTD.
A PMML document is stored in a file with a .pmml extension. Also, the document type definition for PMML is specified using SGML/XML and stored in a file with a .dtd extension.
2.6 Advantages of using SGML/XML based formalism
The advantages of using an SGML/XML based formalism are many.
A related issue is the trends in the practical applications. In fields like healthcare, SGML based formats are fast becoming the standard.
This text based approach may not be as efficient as other methods, but as discussed in section 2.4, in our view that is not a primary requirement here.
The current implementation of PMML 0.8 supports predictive models like decision trees (C4.5, ID3 and CART (Brieman et al, 1984)) and linear regression. The modular design allows for the later addition of specifications for other models once the experts in those fields come up with a format.
2.7 File format for Single Predictive Models
The structure of a PMML document for a single predictive model is described in Appendix A. The format provides mechanisms for describing the data schema as well as the predictive models (like C4.5, CART, ID3 and linear regression).
2.8 File format for Ensemble of models
"Ensembles of models are important for several reasons. 1) Ensembles of models often yields more accurate predictors (Brieman, 1996 and Wolpert, 1992). 2) Large data sets may be partitioned, the different partitions may be mined in parallel to produce an ensemble of models, and then the results combined using model averaging (Grossman, 1996). 3) Distributed data may initially be mined separately to produce an ensemble of models and then a single predictive model produced from this ensemble. Sometimes this is called meta-learning (Chan and Stolfo, 1995)." (Grossman, 1998a). The structure of PMML documents for ensembles of models is described in Appendix B. The key addition for ensembles of models is the introduction of the model-weight element.
3. Resource Information Management in heterogeneous, distributed data mining environments
In the previous chapter, a formalism for the description of predictive models was detailed. Many more control mechanisms are needed for this formalism to work in heterogeneous and distributed data mining environments. This chapter presents an experimental framework for resource management in these environments and details its core structure, the RIF (Resource Information Format) model for unified metadata representation and management. This resource management infrastructure is designed on need basis for an underlying architecture whose characteristics are presented first.
3.1 Characteristics of the Underlying Architecture
The following features of the architectures need to be considered in designing the resource management framework.
To accommodate a large number of cooperating sites, the sites are hierarchically grouped into clusters, with each cluster having a well-defined point of entry. Since a Wide Area Network might link the clusters, each cluster might have different hardware and storage capabilities.
Each cluster must have complete control over its own resources. That includes what data and knowledge resources to keep and where to keep them. Each site can choose its data representation (e.g., relational, object-oriented, etc.) and appropriate data discovery tools (decision trees, neural networks, etc.).
Global queries are dynamically formed and allocated to clusters based on the latest resource information available. Thus, the resource information gathering is event-triggered.
Since moving large amounts of data around is a heavy-weight operation, the query is moved to the data. Also, the data allocation is static or infrequently changing for such architectures.
3.2 Unified Global Resource Management Model
A global information resources framework is developed to meet the requirements of the above architecture. It is global because the names in this framework are global. Two basic elements are necessary for the resource management environment:
The logical model for meta-data representation is a core element of this infrastructure and is physically embodied in the proposed Resource Information Format described below.
The management system includes the cluster-point-of-entry infrastructure for cluster management and the Tcl software agent system (Ashok, 1998) to communicate between the clusters.
The emphasis in this chapter is on meta-data design and format. Refer to chapter 4 for details on the management system.
3.3 Role of Meta-data in Emerging Systems
Meta-data is a key element in any resource management system. Representative (previous) meta-data systems include traditional data dictionary systems, the Information Resources Dictionary System (IRDS) of the National Institute of Standards and Technology (ANSI X3H4, 1985), the metadatabase system (Hsu, et al., 1991) and, more recently, the Meta Data Council’s Interchange Specification Initiative (MDC, 1997). Most of these are limited to containing only data resources models. The scope of the meta-data needs to be extended for emerging data mining systems because of the following two considerations.
In data mining systems, the meta-data must represent both data systems and knowledge resources. Thus, the contents of the meta-data include not only information about the data resources (like the data schemas, etc.) but also information like the knowledge models (e.g., predictive models), data discovery tools and data extraction tools.
To realize distributed and collaborative data mining environments, the meta-data should model not just the subsystems, i.e., individual sites, but also the information that formulate the interactions between sites that may be heterogeneous and located over a Wide Area Network (WAN). WANs introduce heterogeneity like disparate data views that may not exist for LANs.
Thus, the scope of meta-data encompasses both data resources and knowledge, and requires new methods and techniques to manage these information resources in a unified way for diverse user groups. The logical components of meta-data in distributed data mining systems are discussed next.
3.4 Logical Components of Metadata
There are several requirements for an ideal meta-data representation method that can be derived from the above discussion. The requirements include:
While the data views and decision knowledge are essential for intra-cluster autonomy and functioning, operating knowledge is the glue that acts as an active facilitator for inter-cluster information integration and allows users to tap resources over wide area clusters for global query formation and processing.
We have chosen a relation look-alike approach to accommodate heterogeneous data views. Thus a dataset element is viewed to consist of several table elements and associated predictive model elements. A table element is described by a set of attribute-descriptor elements that contain information like use-as and data-type which are required by data discovery algorithms. A model element has information like the model name and model type.
Dataset elements are encompassed by domain elements that help to generalize the framework over different domains. The domain elements, in turn, are encompassed by cluster elements. Cluster elements represent the cluster-point-of-entry and have information on the data discovery tools available at the cluster in the form of data mining engine elements.
These meta-data elements in our framework are described by what we call as the Resource Information Format (RIF). The design guidelines and specifications of this format are presented next.
3.5 Design Goals for Resource Information Format
The design goals for the resource interchange format are similar to the design goals for MIF, i.e., universality, portability, extensibility and human-readability.
Using reasoning similar to MIF, an SGML based declarative formalism was chosen to describe RIF for easy integration into the system. The language used to describe the format is also called RIF in the present implementation, though ideally it should be called RIML (Resource Interchange Markup Language). The specifications for the RIF 0.8 implementation are described in Appendix C.
4. Papyrus: Philosophy, Design and Architecture
In this chapter, an overview of the meta-cluster approach is presented and Papyrus, a system that employs PMML for inter-operability, RIF for resource information management and the meta-cluster approach to address large scale distributed data mining applications is described. The Papyrus reveals a powerful and easily extensible software agent based system that computes classifiers over the distributed data. Papyrus is being used in experiments dealing with real-world learning tasks as predicting patient outcomes in medical data. We are also using Papyrus as a work-bench for experiments on multiple-model learning.
Meta-cluster technique provides a unifying and scalable solution that improves the efficiency and accuracy of inductive learning when applied to large amounts of data in wide area computing networks for a range of different applications.
In this approach, data is distributed over a number of sites, with all the sites on a particular LAN being identified as a cluster. The clusters are connected to each other via a WAN. A number of learning processes (each implemented as a distinct serial program) are executed in parallel, on these data sets, or data subsets. The collective results are then combined through meta-cluster. Stolfo, et al., (1997) and Kargupta, et al., (1997) have also investigated the meta-cluster approach.
Papyrus is architectured as a distributed computing construct. Papyrus uses PMML for the predictive model description, RIF for the resource information description, Tcl agent (Gray J, 1996) for communications between clusters, and Ptool (Grossman, et al., 1994) as the data warehouse.
Papyrus is implemented as a collection of distributed learning and classification programs linked together through a network of clusters. Each cluster consists of:
First, the user accesses the Data Mining Point Of Presence (DMPOP) through a web browser. After proper authentication the DMPOP sends the agents out to all clusters accessible via CPOEs to get the information about the resources available. These resources can be data-sets, existing predictive models or various data mining services. Based on this information (represented using RIF) the user then interactively builds a data mining query. The query could be to build a model on some particular data-set, score using some existing models, or visualize an existing model. The agents then dispatch this query to the relevant clusters and initiate their execution. The data mining engine at each cluster executes the query on the local data sets and the agents then send the results (represented using MIF) back to the DMPOP, which then displays them to the user. The logical architecture of the system is presented in Figure 4.
Figure 4: The logical architecture of Papyrus system.
4.3 Major Components of Papyrus
Agent-Tcl (Gray J, 1996) was used as the transportable agent system, because Tcl (Ousterhout, 1994) is an embeddable scripting language that is highly portable, highly popular and freely available.
This implementation has various library routines and utilities to clean and transform the data sets appropriately for the various predictive model building algorithms.
4.4 Advantages of Papyrus Architecture
Papyrus was included in the demonstrations performed by the Laboratory for Advanced Computing (LAC) at the Internet 2 conference at Washington DC (Oct. 7-9, 1997) and at Supercomputing 97 conference in San Jose (Nov. 16-21, 1997). Interested readers can learn more about this project at http://www.lac.uic.edu/~balinder/medical/sc97/top.html.
This chapter describes our initial experiments of using Papyrus for multiple-model learning on inherently distributed data over wide area networks. The time complexity and error rates of building c4.5 decision tree classifiers concurrently at multiple sites, transporting them to a central site and combining the scores using majority voting are tabulated. The results show that the time complexity for multiple-model learning is more linear in terms of the number of training examples as compared to the single model approach. Also, in certain cases, multiple-model learning outperforms the single model approach in terms of prediction errors.
5.1 Inherently distributed data
The increase in the amount of readily available information has furthered the development of distributed information storage. In certain cases, data is inherently distributed and cannot be localized on any one machine for a variety of practical reasons. In other cases data is artificially distributed due to the limitation of certain algorithms to work on larger data sets.
There are basically two approaches to build a global classifier on such distributed data. The first approach would be to move all the data partitions to one central site and then build a single predictive model on the entire data set. The second approach would be to execute a number of learning processes on these data partitions and then bring back the models to a central site to integrate the results using model selection or model averaging. The former would result in a lot of bandwidth usage while the latter will raise issues regarding the accuracy of the classifier. Current approaches to multiple-model learning like Bagging and Boosting focus on minimizing error percentage of decision tree classifiers. The data partitions used are not independent of each other, e.g., each data partition in bagging on the average has 63% of the original data set. Hence the focus so far has been mostly on error rate trends on artificially distributed data.
Our intention is to deal with inherently distributed data. . Learning from an ensemble of models from inherently distributed data raises several speed, accuracy and communication issues. The data partitions in such cases are disjoint and hence the accuracy results of techniques like Bagging and Boosting can't be directly related. Also, time complexity of this multiple-model approach is expected to be significantly less than the single-model approach, if the models are built concurrently. The nature of the networks involved i.e. WAN raises new communication issues. The network connection between geographically remote sites may or may not perform well compared to a local network of data sites.
We here present our experiences with the multiple-model learning approach using Papyrus. We have experimented with the C4.5 (Quinlan, 1993) decision tree classifier. The data-set we have used is the health care data obtained from University Health Consortium (UHC) and is naturally distributed on the zip-code information. The data sites at the Laboratory for Advanced Computing (LAC) at UIC, at the University of Maryland and at the University of Pennsylvania, connected via very high speed Backbone Network Service (vBNS) network were used.
Our intention is to focus on the speed and accuracy issues of such a system. The primary goal of the experiments is to present the empirical results comparing our approach with the alternative approach of moving the data to the query. Also we have presented the time complexity trends of building ensemble of models over the LAN and WAN. The statistical analysis of the error rates and communication issues would be the focus of future work.
In this chapter we describe the experimental setup used for the speed and accuracy tests.
The experiments conducted focussed on the speed and accuracy aspects of distributed learning. We intended to provide empirical results to compare the approach of moving the data to the query vs. the approach of moving the models, in case of large, inherently distributed data-sets. In particular we had the following goals:
Speed
1) Analyze the time complexity of building ensemble of models over a range of ensemble sizes.
2) Compare the time complexity of building ensembles of models over the local area network to that over the wide area networks.
Accuracy
Present the prediction error rates for the ensembles of models and compare them with the single model approach rates.
The experiments were conducted for training datasets ranging from 1000 examples (0.26 MB) to 200,000 examples (50.10 MB) and distributed between 1 to 21 data sites, both over the local network and the wide area network (vBNS). The experiments were performed by moving the query to the data and also by moving the data to the query.
5.2.3 Time and Accuracy Results
Figure 5 plots the prediction error percentage vs. the number of examples in the training dataset for no. of models ranging from 1,3,…to 21. The corresponding data is in Table I.
Figure 6 plots the time complexity in seconds vs. the number of examples in the training dataset for no. of models ranging from 1,3,5 … to 21 for local area tests. The corresponding data is in Table II.
Figure 7 plots the time complexity in seconds vs. the number of examples in the training dataset for no. of models ranging from 3,5… to 21 for wide area tests. The corresponding data is in Table III.
Figure 8 plots the time complexity in seconds vs. the number of examples in the training dataset for data transfer and model transfer approaches for 5 data sites. The corresponding data is in Table IV.
Table I: Prediction Error Trends for Models built on LAN and WAN
#Records |
1-Model |
3-Models |
5-Models |
7-Models |
9-Models |
15-Models |
21-Models |
1000 |
14.4 |
14 |
13.6 |
13.6 |
15.2 |
14.8 |
14 |
5000 |
14.9 |
14.96 |
15.36 |
15.36 |
15.12 |
15.6 |
15.2 |
10000 |
12.7 |
13.44 |
12.92 |
14.32 |
13.92 |
14.24 |
14.16 |
50000 |
13.8 |
11.984 |
13 |
12.912 |
13.072 |
13.072 |
13.126 |
100000 |
12.8 |
12.812 |
12.912 |
12.844 |
12.852 |
12.972 |
13.128 |
200000 |
12.37 |
12.16 |
12.284 |
12.226 |
12.228 |
12.198 |
Columns 2 through 8 show the percentage error in prediction.
Figure 5: Prediction Error Trends for Models built on LAN and WAN
Table II: Time Complexity Trends Using Local Cluster at UIC
#Records |
1-Model |
3-Models |
5-Models |
7-Models |
Bench Mark |
1000 |
5.05 |
9.7 |
14.56 |
20.64 |
2.26 |
5000 |
10.31 |
12.12 |
17.6 |
23.98 |
9.25 |
10000 |
18.46 |
15.5 |
20.91 |
27.98 |
18.62 |
50000 |
154.89 |
60.64 |
86.78 |
71.47 |
156.81 |
100000 |
437.47 |
131.04 |
114.03 |
137.48 |
424.14 |
200000 |
Columns 2 through 8 show the time taken (in seconds), to build the models and combine the results.
Figure 6: Time Complexity Trends Using Local Cluster at UIC
Table III: Time Complexity Trends Using Clusters at UIC, UPENN and UMD
#Records |
3 Models |
5 Models |
7 Models |
9 Models |
15 Models |
21 Models |
1000 |
10.56 |
17.4 |
22 |
28.27 |
47.47 |
62.44 |
5000 |
14.06 |
21.1 |
26.4 |
33.01 |
53.18 |
77.94 |
10000 |
19.6 |
27.28 |
32.85 |
40.01 |
63.41 |
88.17 |
50000 |
82.49 |
79.56 |
86.92 |
155.49 |
156.86 |
200.89 |
100000 |
185.01 |
161.24 |
139.54 |
183.89 |
255.41 |
362.65 |
200000 |
410.11 |
330.52 |
335.14 |
294.42 |
412.61 |
675.6 |
Columns 2 through 8 show the time taken (in seconds), to build the models and combine the results.
Figure 7: Time Complexity Trends Using Clusters at UIC, UPENN and UMD
Table IV: Time Complexity - Data Transfer vs. Model Transfer (over 5 locations)
# Records |
LAN |
WAN |
C45 Execution Time |
LAN |
WAN |
LAN |
WAN |
DT Time |
DT Time |
DT & C45 |
DT & C45 |
DT & C45 |
DT & C45 |
||
5000 |
0.6 |
1.661 |
9.25 |
9.85 |
10.911 |
17.6 |
21.1 |
10000 |
0.96 |
2.885 |
18.62 |
19.58 |
21.505 |
20.91 |
27.28 |
50000 |
3 |
11.96 |
156.81 |
159.81 |
168.77 |
86.78 |
79.56 |
100000 |
4.3 |
21.75 |
424.14 |
428.44 |
445.89 |
114.03 |
161.24 |
DT : Data Transfer
MT : Model Transfer
Columns 2 through 8 indicate the time in seconds.
Figure 8: Time Complexity - Data Transfer vs. Model Transfer (over 5 locations)
Using our infrastructure (128 MB RAM), C4.5 could not handle 150,000 examples for single model approach; but we could run it on more than 200,000 examples using the multiple model approach.
Empirical results show that the time complexity of building ensembles of decision tree classifiers is less after a break-even point as compared to the single model approach. This break-even point is a function of the number of models in the ensemble. Also, the time is more linear for the ensembles in terms of the number of the training examples used.
It is shown that the error rates of ensembles are less for small numbers of training examples but fare more poorly for large numbers of training examples. However there seems to be an ensemble size (between 5-7 models range) which consistently performs better then a single model.
An early conclusion is that there is a trade-off involved in using multiple-model learning for inherently distributed and hence disjoint datasets. If time complexity is a major concern in predictive modeling, then multiple-model learning is decidedly better than a single model approach. But, for prediction error rates, multiple-model learning is good only for a range of ensemble sizes and not for every ensemble.
6. Conclusions and Future Work
This thesis report introduced the architecture of Papyrus, a distributed, scalable and extensible system that supports the launching of learning agents to distributed sites over wide area networks.
The application design concepts embodied by the framework of Papyrus, i.e.,
provide an important step in developing systems that learn from massive data-sets in distributed and collaborative data mining environments.
The performance results of Papyrus on inherently distributed data indicate that it is a suitable workbench for experiments on multiple-model learning. The speed and accuracy trends of these experiments were tabulated.
While the current implementation of Papyrus uses SGML based MIFs (described by PMML 0.8 and C++ libraries), work is in progress on the next version, which uses XML (described by PMML 0.9 and Java libraries). Also specifications for more learning algorithms are sought to be incorporated in next release. The software agent system used currently is a Tcl system; Java agents if developed would provide more portability.
The performance results provide scope for future research, e.g., statistical analysis of the prediction error trends, etc. Also the communication issues involved in multiple-model learning (like the network bandwidth usage, etc.) will be investigated in future.
APPENDIX A: File Format for Single Predictive Models
The Structure of PMML documents
PMML 0.8 Documents start with a <!DOCTYPE> declaration followed by a MIF element containing a HEADER and then a MODEL element:
<!DOCTYPE PMML PUBLIC "-//LAC//DTD PMML 0.8 ">
<MIF>
<HEADER>
<DATA-SCHEMA>
... other head elements
</HEADER>
<MODEL>
...elements specific to a model
</MODEL>
</MIF>
In practice, the MIF, HEADER and MODEL end tags can be omitted from the markup as these can be inferred in all cases by parsers conforming to the PMML 0.8 DTD.
Every conforming PMML 0.8 document must start with the <!DOCTYPE> declaration that is needed to distinguish PMML 0.8 documents from other versions of PMML. Every PMML 0.8 document must include DATA-SCHEMA in the HEADER block and a model specific part like CART-MODEL, ID3-MODEL, etc. in the MODEL block. A minimal PMML 0.8 document thus looks like:
<!DOCTYPE PMML PUBLIC "-//LAC//DTD PMML 0.8 ">
<MIF>
<HEADER>
<DATA-SCHEMA>
... other head elements
<MODEL>
...elements specific to a model
The HEADER element
<! ELEMENT HEADER – O (DATA-SCHEMA & CREATION-INFORMATION?) >
This contains the document header, but you can always omit the end tag for HEADER. The contents of the document header is an unordered collection of the following elements:
<!ELEMENT DATA-SCHEMA - -
(ATTRIBUTE-DESCRIPTOR, ATTRIBUTE-DESCRIPTOR+ ) >
Every PMML 0.8 document must have exactly one DATA-SCHEMA element in the document's HEADER. It provides the data-schema on modeled by the given MIF file. It must contain at least two attribute descriptors, one being the predicted attribute.
<! ELEMENT ATTRIBUTE-DESCRIPTOR - O ( mapping-function? ) >
<!ATTLIST ATTRIBUTE-DESCRIPTOR
NAME CDATA #REQUIRED
USE-AS (exclude | continuous | category | binary-category) #REQUIRED
DATA-TYPE ( real | integer | boolean | string ) #REQUIRED
>
It describes a single attribute of the data-schema. It can contain at most one mapping function. NAME specifies the name of the attribute, USE-AS specifies usage of this attribute in the data mining process and DATA-TYPE specifies the way attribute is stored in the database.
<!ELEMENT MAPPING-FUNCTION - - CDATA >
<!ATTLIST MAPPING-FUNCTION TYPE CDATA #REQUIRED>
The mapping function describes the transformation to be performed on the attribute. The TYPE attribute indicates the language in which the mapping function is written.
<!ELEMENT CREATION-INFORMATION - -
( COPYRIGHT? & APPLICATION? & INDIVIDUAL? & TIMESTAMP? ) >
This is the information about how, when and by whom the model was created. It is optional and the tags in this sub tree are self-explanatory.
The MODEL element
<!ELEMENT MODEL - O
( CREATION-INFORMATION?, (CART-MODEL | REGRESSION-MODEL | ID3-MODEL ) ) >
<!ATTLIST MODEL
NAME CDATA #REQUIRED
TYPE (CART | C4.5 | OC-1) #REQUIRED
TRAINING-DATA-NAME CDATA #IMPLIED
TRAINING-DATA-SIZE NUMBER #IMPLIED
>
This contains the model specific part of the document. The end tag for MODEL may be omitted. The key attributes are: model name and the model type.
This field is same as what was described for the HEADER block.
This block describes the details of a particular type of predictive model. We here present the MIF for the models we support.
<!ELEMENT C45-MODEL - - ( (C45-NODE | C45-LEAF-NODE)+ ) >
<!ATTLIST C45-MODEL
...
>
<!ELEMENT CART-NODE - O EMPTY >
<!ATTLIST CART-NODE
... >
<!ELEMENT CART-LEAF-NODE - O EMPTY>
<!ATTLIST CART-LEAF-NODE
…
>
<!ELEMENT CART-MODEL - - ( (CART-NODE | CART-LEAF-NODE)+ ) >
<!ATTLIST CART-MODEL
TYPE ( binary-classification | classification | regression ) #REQUIRED
ATTRIBUTE-PREDICTED CDATA #REQUIRED
NUMBER-NODES NUMBER #REQUIRED
DEPTH NUMBE #REQUIRED
>
It marks the beginning of a cart-model. The attributes of this tag include, the TYPE of the CART model, the attribute that is predicted using this model, the number of nodes in the tree and the depth of the tree.
<!ELEMENT CART-NODE - O EMPTY >
<!ATTLIST CART-NODE
NODE-NUMBER NUMBER #REQUIRED
ATTRIBUTE-NAME CDATA #REQUIRED
LEFT-CHILD NUMBER #REQUIRED
RIGHT-CHILD NUMBER #REQUIRED
CUT-VALUE CDATA #REQUIRED
>
This denotes a non-leaf node in the tree. Its attributes are the node number, the attribute name associated with the node, the node numbers of its left and right children and the cut value.
<!ELEMENT CART-LEAF-NODE - O EMPTY>
<!ATTLIST CART-LEAF-NODE
NODE-NUMBER NUMBER #REQUIRED
SCORE CDATA #REQUIRED
>
This denotes a leaf node in the tree. Its attributes are the node number and the class value associated with it.
c) ID3
<!ELEMENT ID3-MODEL - - ( (ID3-NODE | ID3-LEAF-NODE)+ ) >
<!ATTLIST ID3-MODEL
ATTRIBUTE-PREDICTED CDATA #REQUIRED
NUMBER-NODES NUMBER #REQUIRED
DEPTH NUMBER #REQUIRED
>
It marks the beginning of a id3-model. The attributes of this tag include, the attribute that is predicted using this model, the number of nodes in the tree and the depth of the tree.
<!ELEMENT ID3-NODE - O EMPTY >
<!ATTLIST ID3-NODE
NODE-NUMBER NUMBER #REQUIRED
ATTRIBUTE-NAME CDATA #REQUIRED
CUT-VALUE CDATA #REQUIRED
LEFT-CHILD NUMBER #REQUIRED
RIGHT-SIBLING NUMBER #REQUIRED
>
This denotes a non-leaf node in the tree. Its attributes are the node number, the attribute name associated with the node, the node numbers of its left child and right sibling and the cut value.
<!ELEMENT ID3-LEAF-NODE - O EMPTY>
<!ATTLIST ID3-LEAF-NODE
NODE-NUMBER NUMBER #REQUIRED
CUT-VALUE CDATA #REQUIRED
SCORE CDATA #REQUIRED
RIGHT-SIBLING NUMBER #REQUIRED
>
This denotes a leaf node in the tree. Its attributes are the node number, the cut value, the class value and the node number of its right sibling associated with it.
d) LINEAR REGRESSION
<!ELEMENT LINEAR-REGRESSION-MODEL - O ( LINEAR-REGRESSION-COEFFICIENT )+ >
<!ATTLIST LINEAR-REGRESSION-MODEL
DIMENSION CDATA #REQUIRED
>
It marks the beginning of a linear regression model. The attribute of this tag is the dimension of the model.
<!ELEMENT LINEAR-REGRESSION-COEFFICIENT - - EMPTY>
<!ATTLIST LINEAR-REGRESSION-COEFFICIENT
COEFFICIENT-POSITION CDATA #REQUIRED
COEFFICIENT-VALUE CDATA #REQUIRED
>
This tag gives the position and the coefficient value for that position in the linear regression model.
<!DOCTYPE PMML PUBLIC "-//LAC//DTD PMML 0.8 ">
<mif>
<header>
<creation-information>
<copyright>
Copyright (C) 1997 Magnify Research. All Rights Reserved.
</copyright>
<application name='PATTERN:Detect' version='1.3alpha'>
<organization name='Magnify Research' url='www.magnify.com'>
<postal-address street1='100 South Wacker Drive' street2='Suite 1130' city='Chicago' state='Il' country='US' zipcode='60606-4003'>
</organization>
</application>
</creation-information>
<data-schema>
<attribute-descriptor name='card number' use-as='exclude' data-type='string'>
<attribute-descriptor name='timestamp' use-as='exclude' data-type='string'>
<attribute-descriptor name='dollar amount' use-as='continuous' data-type='real'>
<attribute-descriptor name='issuer' use-as='category' data-type='integer'>
<attribute-descriptor name='acquirer' use-as='category' data-type='integer'>
<attribute-descriptor name='sic code' use-as='category' data-type='integer'>
<attribute-descriptor name='fraud indicator' use-as='boolean' data-type='binary'>
<mapping-function type='perl'>
sub Char_Match_indicator
{
# the input will always be a scalar found in @_[0]
$input = @_[0];
# for debugging purposes, print out the input value
# printf ( "input is $input\n" );
# scrub char value into a boolean
if ( $input eq "G" )
{
$output = 0;
}
elsif ( $input eq "X" )
{
die "unconsumated record found in database\n";
}
else
{
$output = 1;
}
# for debugging purposes, print out the output value
# printf ( "output is $output\n" );
return $output;
}
</mapping-function>
<attribute-descriptor name='derived attribute 0' use-as='continuous' data-type='real'>
<attribute-descriptor name='derived attribute 1' use-as='continuous' data-type='real'>
<attribute-descriptor name='derived attribute 2' use-as='continuous' data-type='real'>
</data-schema>
<model-weights>
<weight model-name='simple tree' weight='1'>
</model-weights>
</header>
<model name='simple tree' training-data-name='./data/train' training-data-size='50000'>
<cart-model type='binary-classification' attribute-predicted='Fraud Indicator' number-nodes='13' depth='3'>
<cart-node node-number='0' attribute-name='derived attribute 3' cut-value='-0.158746' left-child='1' right-child='6'>
<cart-node node-number='1' attribute-name='derived attribute 65' cut-value='-0.273673' left-child='2' right-child='3'>
<cart-leaf-node node-number='2' score='0'>
<cart-node node-number='3' attribute-name='derived attribute 19' cut-value='-0.641615' left-child='4' right-child='5'>
<cart-leaf-node node-number='4' score='0'>
<cart-leaf-node node-number='5' score='1'>
<cart-node node-number='6' attribute-name='derived attribute 17' cut-value='-0.082279' left-child='7' right-child='10'>
<cart-node node-number='7' attribute-name='sic code' cut-value='5967' left-child='8' right-child='9'>
<cart-leaf-node node-number='8' score='1'>
<cart-leaf-node node-number='9' score='0'>
<cart-node node-number='10' attribute-name='derived attribute 27' cut-value='0.583846' left-child='11' right-child='12'>
<cart-leaf-node node-number='11' score='0'>
<cart-leaf-node node-number='12' score='1'>
</cart-model>
</model>
</mif>
APPENDIX B: File Format for Ensemble of Models
In this Appendix we propose extensions to the above described PMML, thereby incorporating ensemble of models built on the same data-schema in a single PMML document. The only modification occurs in the HEADER block where a tag called MODEL-WEIGHTS is introduced. The new HEADER block looks like
<! ELEMENT HEADER – O
(DATA-SCHEMA & CREATION-INFORMATION? & MODEL-WEIGHTS?) >
<!ELEMENT MODEL-WEIGHTS - - ( WEIGHT + ) >
Model weighting information must contain one weight per model in the document. If there is only one model then this is omitted.
<!ELEMENT WEIGHT - O EMPTY >
<!ATTLIST WEIGHT
MODEL-NAME CDATA #REQUIRED
WEIGHT-VALUE CDATA #REQUIRED >
The weight gives the weight assigned to a model.
The MODEL block now contains one more models conforming to their structure as defined in the Document Type Definition.
<MODEL>
<CART-MODEL>
information pertinent to cart-model
</CART-MODEL>
…
<LINEAR REGRESSION>
information pertinent to linear regression
</LINEAR REGRESSION>
</MODEL>
APPENDIX C: File Format for Resource description
The Structure of RIF documents
A RIF document starts with <!DOCTYPE> declaration followed by a RIF element containing one or more CLUSTER element:
<!DOCTYPE RIF PUBLIC "-//LAC//DTD RIF 0.8">
<RIF>
<CLUSTER>
<DOMAIN>
… other cluster elements
</CLUSTER>
…
</RIF>
In practice, the RIF tag can be omitted from the markup. The RIF document should have atleast one CLUSTER element and each CLUSTER element should have at-least one DOMAIN element.
CLUSTER element
<!ELEMENT CLUSTER - O (DOMAIN+, DM-ENGINE*) >
<!ATTLIST CLUSTER
NAME CDATA #REQUIRED
POINT-OF-ENTRY CDATA #REQUIRED >
The CLUSTER element represents one cluster and contains atleast one DOMAIN element and zero or more data mining engines present at that cluster, represented by DM-ENGINE element. NAME attribute specifies the global-name of the cluster and POINT-OF-ENTRY represents the cluster-point-of-entry.
DOMAIN element
<!ELEMENT DOMAIN - O (DATASET+) >
<!ATTLIST DOMAIN
NAME CDATA #REQUIRED >
This represents the domain of application like Medical, Physics etc. as represented by NAME attribute. Each domain consists of one or more DATASET elements.
<!ELEMENT DATASET - O ((TABLE,MODEL*)+ >
<!ATTLIST DATASET
NAME CDATA #REQUIRED >
Each dataset consists of one or more tables and the associated predictive models. NAME attribute specifies the global name of the dataset.
<!ELEMENT TABLE - O (SCHEMA) >
<!ATTLIST TABLE
NAME CDATA #REQUIRED >
The TABLE element consists of schema for the table. NAME attributes specifies the name of the table.
<!ELEMENT SCHEMA - O (ATTRIBUTE-DESCRIPTOR, ATTRIBUTE-DESCRIPTOR+) >
The data schema of a table is denoted using a set of attribute-descriptors. There should be atleast two attributes in each table for any data discovery purposes.
<! ELEMENT ATTRIBUTE-DESCRIPTOR - O ( mapping-function? ) >
<!ATTLIST ATTRIBUTE-DESCRIPTOR
NAME CDATA #REQUIRED
USE-AS (exclude | continuous | category | binary-category) #REQUIRED
DATA-TYPE ( real | integer | boolean | string ) #REQUIRED
>
It describes a single attribute of the data-schema. It can contain at most one mapping function. NAME specifies the name of the attribute, USE-AS specifies usage of this attribute in the data mining process and DATA-TYPE specifies the way attribute is stored in the database.
<!ELEMENT MODEL - O (#PCDATA) >
<!ATTLIST MODEL
NAME CDATA #REQUIRED
TYPE CDATA #REQUIRED >
This represents a predictive model built on the associated data schema. The NAME attribute specifies the global name of the model and TYPE specifies the predictive model type e.g. C4.5 or CART for decision trees.
DM-ENGINE element
<!ELEMENT DM-ENGINE - O (HEADER,MODEL-TYPE+) >
DM-ENGINE represents the data discovery tools present at a cluster. Each data mining engine consist of some HEADER information and the nature of the discovery tools represented by the MODEL-TYPE element. A DM-ENGINE can have one or more types of data discovery tools.
<!ELEMENT HEADER - O (#PCDATA) >
<!ATTLIST HEADER
NAME CDATA #IMPLIED
VERSION CDATA #IMPLIED
COPYRIGHT CDATA #IMPLIED
TIMESTAMP CDATA #IMPLIED
>
It contains some optional information about the data mining engine like the name of the implementor, version number, copyright details and the timestamp.
<!ELEMENT MODEL-TYPE - O (#PCDATA) >
MODEL-TYPE represents the name of a data discovery tool like C4.5 or CART.
ANSI X3H4 (1985). American National Standard Information Resource Dictionary System: Part 1-Core Standard. ANSI X3H4, American National Standards Institute, New York, 1985.
Ashok R. (1998). Incorporating transportable software agents into a wide area high performance distributed data mining system. Thesis report, Department of Electrical Engineering and Computer Science, University of Illinois at Chicago, 1998.
Brieman L. (1996). Bagging Predictors. Machine Learning, Volume 24, Number 2, pages 123-140.
Brieman L., Friedman J., Olshen R. and Stone C. (1984). CART: Classification and Regression Trees. Belmont, California, Wadsworth.
BNIF (1997). Proposal for a Bayesian Network Interchange Format. Decision Theory Group, Microsoft Research, 1996.
Buneman P., Davidson S., Grossman R. and Maier D. (1995). Interoperating with Non-Database Data. University of Pennsylvania, Computer Science Department Technical Report, 1995.
Chan P. and Stolfo S. (1995). A Comparative Evaluation of Voting and Meta-Learning on Partitioned Data. Proceedings of the Twelfth International Conference on Machine Learning, pages 90-98.
Chen M.S., Han J. and Yu P.S. (1996). Data Mining: An overview from Database Perspective.
Crawley S., Davis S., Indulska J., McBride S., and Raymond K. (1997). Meta Information Management. Formal Methods for Open Object-based Distributed Systems (FMOODS) Conference, Canterbury, UK, July 1997.
Domingos P. (1997). Why does Bagging Work ? A Bayesian Account and its Implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 155-158.
Drucker H. (1997). Fast Committee Machines for Regression and Classification. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 159-162.
Gray J. (1996). Data Management: Past, Present, and Future. IEEE Computer’s 50th Anniversary issue.
Gray R., Rus D. and Kotz D.(1996) Transportable Information Agents. Department of Computer Science, Dartmouth College, Technical Reports, 1996.
Gray R.(1996). Agent-Tcl, PhD Thesis . Department of Computer Science, Dartmouth College, 1996.
Grossman R.L., Hanley D. and Qin X. (1994). Ptool: A Light Weight Persistent Object Manager. Proceedings of SIGMOD 94, ACM 1994, pp 510.
Grossman R.L., Bodek H., Northcutt and Poor V. (1996). Data Mining and Tree-based Optimization. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Simoudis E., Han J. and Fayyad U., editors, AAAI Press, Menlo Park, 1996, pp 323-326.
Grossman R.L. (1997a). Class lectures for MISI Workshop. University of Illinois at Chicago, Spring 1997.
Grossman R.L. (1997b). Model Interchange Format for Predictive Models. Magnify Press Release, July 1997.
Grossman R.L (1998). Data Mining: Two Cultures. Magnify Technical Report 98-1, submitted for publication.
Grossman R.L and Cubed M. (1998). A Framework for Integrating Data Management, Data Mining and Predictive Modeling within the Knowledge Discovery Process. Magnify Technical Report 98-2, submitted for publication.
Grossman R.L., Bailey S., Pulleyn I., Hallstrom P., Qin X., Ramu A. and Malhi B. (1998a). The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML). Submitted for publication.
Hsu C., Bouziane M., Rattner L. and Yee L. (1991). Information Resources Management in Heterogeneous, Distributed Environments: A Metadatabase Approach. IEEE Transactions on Software Engineering, 1991.
Kargupta H., Hamzaoglu I. And Stafford B. (1997). Scalable Distributed Data Mining – An Agent Architecture. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 211-214.
Jaap J.P. and Davis E.K.(1994). A Format for the Interchange of Scheduling Models. 3rd International Symposium on Artificial Intelligence, Robotics, and Automation for Space, Jet Propulsion Laboratory, Passadena, 1994.
Maxwell T. (1997). A Meta-Model Approach to Modular Simulation. Technical Report, Institute for Ecological Economics, University of Maryland, 1997.
MDC (1997). The Meta Data Interchange Specification Initiative: Goals and Charter. Meta Data Coalition, http://www.he.net/~metadata, June 1997.
Mon D. (1997). Data Warehouses, Data Marts and Data Mining. Class Lectures for HIS course, University of Illinois at Chicago, Fall 1997.
Ousterhout K.(1994). Tcl and Tk Toolkit, in Addison-Wesely, Reading, Massachusetts, 1994.
Parsaye K. (1996). DataMines for DataWarehouses. Database Programming and Design, September 1996.
Parsaye K. (1997a). OLAP and Data Mining: Bridging the Gap. Database Programming and Design, February 1997.
Parasaye K (1997b). A Characterization of Data Mining Technologies and Processes. White Paper, Information Discovery, Inc., 1997.
PMML (1998). The Predictive Model Markup Language Version 0.9, The Data Mining Group.
PowerSim (1997). The PowerSim Corp. page. http://www.powersim.no., 1997.
Provost F. (1997). Scaling Up Inductive Algorithms: An Overview. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 239-242.
Puller D., and Stock K. (1997). Geospatial Modeling: A Case Study for a Statewide Land Information Strategy. Technical Report, The University of Queensland, 1997.
Quinlan J. Ross (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, Inc.
Raftery E., Madigan D. and Hoeting J.A. (1996). Bayesian Model Averaging for Linear Regression Models. Submitted for publication.
Ramasamy K. (1998). Object Relational Data Warehouse. Masters Project report, Laboratory for Advanced Computing, University of Illinois at Chicago, 1998.
SGML (1986). Information Processing – Text and Office Systems – Standard Generalized Markup Language (SGML). ISO 8879:1986.
Shekhar S., Coyle M., Goyal B., Liu D. and Sarkar S. (1997). Data Models in Geographic Information Systems. Communications of the ACM 40(4), 103-111.
Stolfo S., Prodromidis A., Tselepis S., Lee W and Fan D. (1997). JAM: Java Agents for Meta-Learning over Distributed Databases. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, California, August, 1997, pp 74-81.
Tenanbaum A. (1996). Computer Networks: Third Edition. Prentice Hall Inc., New Jersey, 1996.
Tukey J. (1973). Exploratory Data Analysis. New York: McMillan, 1973.
W3 XML (1997). Working Draft Specification for XML. http://www.w3.org/pub/WWW/TR/WD-xml-lang-970331.html.
Wolpert (1992). Stacked Generalization. Neural Neworks, Volume 5, pages 241-259.