
PMML 1.0 -- Overview
What Is PMML?
A PMML document provides a
non-procedural definition of fully trained or
parameterized analytic models with sufficient information
for an application to deploy them. By parsing the PMML
using any standard XML parser the application can
determine the types of data input to and output from the
models, the detailed forms of the models, and how, in
terms of standard data mining terminology, to interpret
their results.
Detailed forms of models will vary
according to model types, but they all are complete
textual definitions. In parsed form they provide enough
information for some other entity to generate a program
or perform a parse-tree driven interpretive execution of
the model.
Version 1.0 of the standard provides a
small set of DTDs that specify the entities and
attributes for documenting decision tree and multinomial
logistic regression models. This is by no means a
comprehensive set, and our expectation is that this
standard will evolve very rapidly to cover a robust
collection of model types. The purpose of publishing this
limited set is to demonstrate the fundamentals of PMML
with a realistic and useful "initial value" of
what will emerge as a comprehensive and rich collection
of modeling capabilities.
Version 1.0 DTDs follow a common
pattern of combining a data dictionary with one or more
model definitions to which that dictionary immediately
applies. As you will see, our dictionary elements are
very primitive. We anticipate and look forward to
subsequent versions of this standard introducing
optimizations, such as bit vector expansions of
categorical fields or log transforms of continuous
fields, but we believe that before such optimizations can
be included it is necessary to agree on minimally
sufficient infrastructure. We also expect to provide
definitions based on XML Schema definitions, once those
become formal W3C recommendations.
Why PMML?
One major goal of PMML is to allow
applications and on-line analytic processing tools to
models obtained from multiple sources without having to
deal with individual differences between those sources.
Another goal is to enable combined, collaborative use of
a potentially very large number of individual models and
proactive administration of collections of models based
on business needs as well as mathematical principles. We
believe these capabilities are fundamental to effective
deployment of analytic models in commercial application
domains. PMML, or something very like it, is urgently
needed to satisfy dramatically increased requirements for
statistical and data mining tools and technologies in
business systems.
PMML Strategy
PMML Version 1.0 has been developed by
a loose affiliation of Angoss, Magnify, NCR, SPSS, and
The University of Illinois, Chicago. Our strategy is to
turn this activity into a W3C working group and have PMML
become a W3C recommendation. As part of W3C affiliation
we expect to increase group membership to include other
major players in the data mining tools and applications
space.
|