[This local archive copy is from the official and canonical URL, http://www.dmg.org/public/software/ncdm/pmml/PmmlDoc.html; please refer to the canonical source document if possible.]
PMML 0.9
DISCLAIMER: PMML 0.9 is intended as a "proof of concept" specification and implementation and is not intended to be a formal PMML standard document.
What is PMML 0.9?
Predictive Model Markup Language (PMML) is used for describing the structure and intent of the data mining models. PMML is a simple markup language that uses XML as its meta-language in a manner similar to the way Hypertext Markup Language (HTML) uses SGML as its meta-language. PMML helps in giving out semantically expressive data mining models from which different predictive models can be built.
PMML 0.9 is a "proof of concept"
specification of the PMML language. This specification defines
the intended interpretation of PMML 0.9 elements, and
places further constraints on the permitted
syntax, which are otherwise inexpressible in
the DTD. A PMML 0.9 document is stored in a file with
a .pmml extension. Also the XML DTD for PMML 0.9 is
specified using XML and stored in a file with a .dtd
extension.
Why PMML?
In recent years a variety of predictive
models have been developed within the
data - mining community. There is also significant
interest in comparing and evaluating the different models.
The PMML is a robust solution
to the problem of
interchanging predictive models and
to performing ensemble and
distributed learning. This text
based markup language for the
predictive models enable easy analysis
and comparison and also eliminates the need for binary
compatibility between the platforms
where these predictive models were built.
Also PMML permits the addition of a new model by just
accommodating it in the DTD. It provides a flexible mechanism
for defining schema for predictive models and
supports model selection and model averaging
when multiple predictive models
are involved. In addition, it
facilitates moving models across applications and systems.
The PMML 0.9 Specification (Click here
for PMM 0.9 DTD)
A PMML 0.9 document consists of several parts:
1) Header
2) Data Schema
3) Data Mining Schema
4) Predictive Model Schema
5) Definitions for
Predictive Models
6) Definitions for Ensembles of Models
7) Rules for Selecting and Combining Models and Ensembles of
Models
8) Rules for Exception Handling.
Among the above components, definition for predictive models (Component 5) is mandatory. In addition a schema for the predictive model must be defined. This can be done using one or more of the schemas - components 3, 4 and 5. All the other components are optional.
Click here for a sample PMML document based on this DTD.
<! ELEMENT HEADER - O (DATA-SCHEMA & CREATION-INFORMATION?) >
This contains the document header, but you can always omit the
end tag for HEADER. The
contents of the document header is an unordered collection of the
following elements:
<!ELEMENT DATA-SCHEMA - -
(ATTRIBUTE-DESCRIPTOR, ATTRIBUTE-DESCRIPTOR+ ) >
Every PMML 0.9 document must have exactly one DATA-SCHEMA
element in the document's
HEADER. It provides the data-schema modeled by the given PMML
file. It must contain at least
two attribute descriptors, one being the predicted attribute.
ATTRIBUTE-DESCRIPTOR Element
<! ELEMENT ATTRIBUTE-DESCRIPTOR - O ( mapping-function? ) >
<!ATTLIST ATTRIBUTE-DESCRIPTOR
NAME CDATA #REQUIRED
USE-AS (exclude | continuous | category|binary-category)#REQUIRED
DATA-TYPE ( real | integer | boolean | string ) #REQUIRED >
It describes a single attribute of the data-schema. It can
contain at most one mapping function.
NAME specifies the name of the attribute, USE-AS specifies usage
of this attribute in the data
mining process and DATA-TYPE specifies the way the attribute is
stored in the database.
MAPPING-FUNCTION
<!ELEMENT MAPPING-FUNCTION - - CDATA >
<!ATTLIST MAPPING-FUNCTION TYPE CDATA #REQUIRED>
The mapping function describes the transformation to be
performed on the attribute. The TYPE
attribute indicates the language in which the mapping function is
written.
CREATION-INFORMATION
<!ELEMENT CREATION-INFORMATION - -
( COPYRIGHT? & APPLICATION? & INDIVIDUAL? & TIMESTAMP? ) >
This is the information about how, when and by whom the model
was created. It is optional and the
tags in this sub tree are self-explanatory.
<!ELEMENT MODEL - O
( CREATION-INFORMATION?, (CART-MODEL | REGRESSION-MODEL | ID3-MODEL ) ) >
<!ATTLIST MODEL
NAME CDATA #REQUIRED
TYPE (CART | C4.5 | OC-1) #REQUIRED
TRAINING-DATA-NAME CDATA #IMPLIED
TRAINING-DATA-SIZE NUMBER #IMPLIED >
This contains the model specific part of the document. The end
tag for MODEL may be omitted.
The key attributes are: model name and the model type.
CREATION-INFORMATION
This field is the same as what was described for the HEADER
block.
This block describes the details of a particular type of
predictive model. We here present the PMML
for the models we support.
1. C4.5 Model
C45-MODEL
<!ELEMENT C45-MODEL - - ( (C45-NODE | C45-LEAF-NODE)+ ) >
<!ATTLIST C45-MODEL
...
>
C45-NODE
<!ELEMENT CART-NODE - O EMPTY >
<!ATTLIST CART-NODE
... >
C45-LEAF-NODE
<!ELEMENT CART-LEAF-NODE - O EMPTY>
<!ATTLIST CART-LEAF-NODE ? >
2.CART Model
CART-MODEL
<!ELEMENT CART-MODEL - - ( (CART-NODE | CART-LEAF-NODE)+ ) >
<!ATTLIST CART-MODEL
TYPE ( binary-classification | classification | regression ) #REQUIRED
ATTRIBUTE-PREDICTED CDATA #REQUIRED
NUMBER-NODES NUMBER #REQUIRED
DEPTH NUMBER #REQUIRED >
It marks the beginning of a cart-model. The attributes of this
tag include, the TYPE of the CART
model, the attribute that is predicted using this model, the
number of nodes in the tree and the
depth of the tree.
CART-NODE
<!ELEMENT CART-NODE - O EMPTY >
<!ATTLIST CART-NODE
NODE-NUMBER NUMBER #REQUIRED
ATTRIBUTE-NAME CDATA #REQUIRED
LEFT-CHILD NUMBER #REQUIRED
RIGHT-CHILD NUMBER #REQUIRED
CUT-VALUE CDATA #REQUIRED >
This denotes a non-leaf node in the tree. Its attributes are
the node number, the attribute name
associated with the node, the node numbers of its left and right
children and the cut value.
CART-LEAF-NODE
<!ELEMENT CART-LEAF-NODE - O EMPTY>
<!ATTLIST CART-LEAF-NODE
NODE-NUMBER NUMBER #REQUIRED
SCORE CDATA #REQUIRED >
This denotes a leaf node in the tree. Its attributes are the
node number and the class value
associated with it.
3.ID3 Model
ID3-MODEL
<!ELEMENT ID3-MODEL - - ( (ID3-NODE | ID3-LEAF-NODE)+ ) >
<!ATTLIST ID3-MODEL
ATTRIBUTE-PREDICTED CDATA #REQUIRED
NUMBER-NODES NUMBER #REQUIRED
DEPTH NUMBER #REQUIRED >
It marks the beginning of a id3-model. The attributes of this
tag include, the attribute that is
predicted using this model, the number of nodes in the tree and
the depth of the tree.
ID3-NODE
<!ELEMENT ID3-NODE - O EMPTY >
<!ATTLIST ID3-NODE
NODE-NUMBER NUMBER #REQUIRED
ATTRIBUTE-NAME CDATA #REQUIRED
CUT-VALUE CDATA #REQUIRED
LEFT-CHILD NUMBER #REQUIRED
RIGHT-SIBLING NUMBER #REQUIRED >
This denotes a non-leaf node in the tree. Its attributes are
the node number, the attribute name
associated with the node, the node numbers of its left child and
right sibling and the cut value.
ID3-LEAF-NODE
<!ELEMENT ID3-LEAF-NODE - O EMPTY>
<!ATTLIST ID3-LEAF-NODE
NODE-NUMBER NUMBER #REQUIRED
CUT-VALUE CDATA #REQUIRED
SCORE CDATA #REQUIRED
RIGHT-SIBLING NUMBER #REQUIRED >
This denotes a leaf node in the tree. Its attributes are the
node number, the cut value, the class value
and the node number of its right sibling associated with it.
4.LINEAR REGRESSION
LINEAR-REGRESSION-MODEL
<!ELEMENT LINEAR-REGRESSION-MODEL - O ( LINEAR-REGRESSION-COEFFICIENT )+ >
<!ATTLIST LINEAR-REGRESSION-MODEL
DIMENSION CDATA #REQUIRED >
It marks the beginning of a linear regression model. The
attribute of this tag is the dimension of the
model.
LINEAR-REGRESSION-COEFFICIENT
<!ELEMENT LINEAR-REGRESSION-COEFFICIENT - - EMPTY>
<!ATTLIST LINEAR-REGRESSION-COEFFICIENT
COEFFICIENT-POSITION CDATA #REQUIRED
COEFFICIENT-VALUE CDATA #REQUIRED >
This tag gives the position and the coefficient value for that
position in the linear regression model.