SourceForge developers have issued two recent updates to Version 3 of the Predictive Model Markup Language (PMML). Considered to be the most widely deployed data mining standard, PMML is an XML markup language used to describe statistical and data mining models.
PMML is formally defined in a W3C XML Schema language. It "describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves. PMML is used for a wide variety of applications, including applications in finance, e-business, direct marketing, manufacturing, and defense. PMML is complementary to several other data mining standards: its XML interchange format is supported by XML for Analysis (XMLA), JSR 73, and 'SQL/MM Part 6: Data Mining'.
As of PMML Version 3.0.2, the specification is said to represent a mature standard such that deployment through the creation of PMML scoring engines is now straight-forward. For PMML version 3.1 and following the development team will continue to add new statistical and data-mining models, reducing the need to use approved extension mechanisms. They also plan to enhance support for data preparation, which is still a labor-intensive task for some applications.
PMML specification development has been advanced for several years by the independent, vendor-led Data Mining Group (DMG), though end user companies are now showing heightened interest. DMG full members as of 2005-04 included IBM Corp; KXEN; Magnify Inc; Microsoft; MicroStrategy Inc.; National Center for Data Mining, University of Illinois at Chicago Oracle Corporation; Prudential Systems Software; Salford Systems; SAS Inc; SPSS Inc; StatSoft, Inc. Associate members include Angoss Software Corp; Insightful Corp; NCR Corp; Quadstone; Urban Science; SAP. Support of PMML in software products is provided by several of these members, and others who desire an XML interchange format for statistical and data mining models.
According to a published "Overview of PMML Version 3.0" by Stefan Raspl (IBM), PMML is an application and system independent interchange format for statistical and data mining models. More precisely, the goal of PMML is to encapsulate a model in an application and system independent fashion so that two different applications (the PMML Producer and Consumer) can use it. PMML Version 3.0 adds the ability to compose certain data mining operations. For example, the outputs of regression models can be used as the inputs to other models (model sequencing) and a decision tree or regression model can be used to combine the outputs of several embedded models (model selection)."
Three new models in PMML Version 3 include rule sets, support vector machines, and text models. "Ruleset models can be thought of as flattened decision tree models, but cover areas where decision trees are not handy or are too limited. Rulesets can be applied to new instances to derive predictions and associated confidences (scoring). Support vector machines define hyperplanes, which try to separate the values of a given target field. The hyperplanes are defined using kernel functions. The most popular kernel types are supported: linear, polynomial, radial basis and sigmoid; they can be used for both classification and regression."
The PMML Version 3 text model consists of the following components: (1) text dictionary that contains the terms in the model; (2) corpus of text documents which identifies the actual texts that are covered by a model; (3) document-term matrix that specifies which terms are used in which document; (4) text model normalization element defining one of several possible normalizations of the document term matrix; (5) text model similarity element to define the similarity used to compare two vectors representing documents.
The PMML specification has undergone successive refinement since (at least) 1997; a version 0.7 developed by National Center for Data Mining (NCDM) at the University of Illinois at Chicago was released in July 1997. A variety of PMML version 0.9 applications were demonstrated at Supercomputing 1998. Version 1.0 was developed by Angoss, Magnify, NCR, SPSS, and The National Center for Data Mining. IBM joined the effort in 1999; Microsoft and Oracle joined in 2000. PMML developers began to use Source Forge for PMML Version 2.1 schemas, documentation, and associated utilities in June 2002.
The KDD-2004 Online Proceedings volume notes that the August 2004 DM-SSP Workshop "marks the fourth year that there has been a KDD workshop on the Predictive Model Markup Language (PMML) and related areas and the second year of a broader conference with the theme of Data Mining Standards, Services and Platforms. One of the goals of PMML was to create a standard interface between producers of models, such as statistical or data mining systems, and consumers of models, such as scoring engines, applications containing embedded models, and other operational systems. There are now quite a few vendors shipping scoring engines, which is an important measure of success in this area. For the past several years, the developers of PMML have been working to create a similar mechanism so that the transformations and compositions required in the data processing, which are so essential to data mining, can be similarly encapsulated."
Principal references:
- PMML Version 3.0.2, May 19, 2005. Version 3.0.2 has only minor updates relative to 3.0.0, with 7 changed files: BuiltinFunctions, GeneralStructure, MiningSchema, ModelComposition, Sequence, Changes, [XSD Schema file]. Available from SourceForge. See the ZIP archive and file listing [source .ZIP, also TGZ]
- PMML Version 3.0:
- PMML 3.0: General Structure of a PMML Document
- PMML v3.0 XML Schema [cache
- Changes in PMML 3.0 from PMML 2.1 "Improvements and additions relating to Association Rules, Builtin Functions, Clustering Model and Data Dictionary."
- PMML v3.0 Overview
- DMG's PMML web site:
- Data Mining Group Home Page
- DMG/PMML FAQ document
- About the DMG: Industry Support
- PMML Sample Models
- Products supporting PMML
- SourceForge Project Summary: Predictive Model Markup Language (PMML)
- PMML SourceForge discussion list
- See also: "Predictive Model Markup Language (PMML)" - Local reference page.
- Workshops:
- ACM SIGKDD 2005. "Data Mining Standards, Services and Platforms." August 21, 2005, Chicago, Illinois. Held in conjunction with The 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005). Maturing standards of interest: (1) Predictive Model Markup Language - PMML; (2) XML for Analysis and OLE DB for Data Mining; (3) SQL/MM Part 6: Data Mining; (4) Java Data Mining (JDM) - Java Specification Request [JSR] 73; (5) CRoss Industry Standard Process for Data Mining - CRISP-DM; (6) OMG Common Warehouse Metadata (CWM) for Data Mining. Contact Kurt Thearling.
- KDD-2004 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 04). Sunday, August 22, 2004, Seattle, WA. See the KDD-2004 online proceedings.
- KDD-2003 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 03). August 27, 2003, Washington, DC.
- Contact: Robert Grossman, Director of the Laboratory for Advanced Computing (LAC) and the National Center for Data Mining (NCDM) at the University of Illinois at Chicago.
- Related data mining specifications using PMML for interchange:
JSR 73 and JSR 247:
- JSR 247: Data Mining 2.0 "JDM 2.0 addresses requested features deferred from JDM 1.0, which focused on the data mining framework and a select number of mining functions and algorithms... Like JDM 1.0, JDM 2.0 will be based on a highly-generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such OMG's CWM, SQL/MM for Data Mining, and DMG's PMML. The JDM model will support four conceptual areas that are generally of key interest to users of data mining systems: settings, models, transformations, and results. The object model provides a core layer of services and interfaces that are available to all clients. Clients consistently see the same interfaces and semantics and are coded to these interfaces..."
- "Java Data Mining (JSR-73): Status and Overview." By Mark F. Hornick, Hankil Yoon, and Sunil Venkayala. Presented at KDD-2004 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 04), Sunday, August 22, 2004. "With the completion of Java Data Mining (JSR-73), customers and vendors now have available a powerful standard to enable applications with data mining, both through Java and Web services. In this paper, we introduce Java Data Mining with examples highlighting both the Java and Web services interfaces. We discuss conformance requirements using the Technology Compatibility Kit (TCK) for vendors implementing the standard. Lastly, we comment on likely features for the next release of JDM. The expert group is now forming for Java Data Mining 2.0 as the JCP Executive Committee approved JSR-247.
- SR 73: Data Mining API. "Addressed the need for a pure Java API that supports the building of data mining models, the scoring of data using models, as well as the creation, storage, access and maintenance of data and metadata supporting data mining results, and select data transformations. Based on a highly-generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such OMG's CWM, SQL/MM for Data Mining, and DMG's PMML. The JDMAPI model support s four conceptual areas that are generally of key interest to users of data mining systems: settings, models, transformations, and results."
XML for Analysis (XMLA):
- Other references:
- Online Proceedings of the Second Annual Workshop on Data Mining Standards, Services and Platforms. KDD-2004 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 04). August 22, 2004, Seattle, WA, USA.
- "An Overview of PMML Version 3.0." By Stefan Raspl (IBM). Presented at KDD-2004 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 04), Sunday, August 22, 2004.
- "A Simple Strategy for Composing Data Mining Operations." By Robert L. Grossman (University of Illinois at Chicago and Open Data Partners and David Hanley University of Illinois at Chicago) and Gregor Meyer (IBM). Presented at KDD-2004 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 04), Sunday, August 22, 2004.
- "PMML: Data Mining for the Masses? PMML Recasts the Data Warehouse as a Turnkey Platform for Real-Time Data Mining." By Stephen Swoyer. From Enterprise Systems (May 25, 2005).
- National Center for Data Mining
- IBM DB2 Intelligent Miner Tools. "The PMML standard allows organizations to develop data mining models by using familiar interfaces and the easy implementation of the models in DB2... All these tools use PMML as the cornerstone of all data mining model data interchange... DB2 Intelligent Miner Scoring can score actual records (stored in DB2 or Oracle databases) against PMML models from any source, such as DB2 Intelligent Miner Modeling, DB2 Intelligent Miner for Data, or any other PMML data mining tool. DB2 Intelligent Miner Visualization can read and visualize PMML models from DB2 Intelligent Miner Modeling, DB2 Intelligent Miner for Data or any other PMML data mining tool..."
- "Web Services Standards for Data Mining." By Robert Chu (SAS). Presented at KDD-2004 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 04), Sunday, August 22, 2004.
- DataSpace Project. "Dataspace is built on open protocols and standards. Queries are done using SOAP/XML and the metadata is generally in XML. Data mining is done using the Data Mining Group's (DMG) Predictive Model Markup Language (PMML)." From the Data Webs FAQ.