Regression Model
Last Modified: July 20 1999, 9:20PM

Contents:

1. Model DTD and tag description.
2. Sample exported XML.
3. Scoring algorithm.


1. Model DTD and tag description.
Note: the model below assumes that dictionary tags are defined elsewhere. Variables are referred to by name.
 

<!-- multinomial-regression scoring model.  -->
<!ENTITY % NUMBER "NMTOKEN">

<!ELEMENT regression-model (
    factor-list?,
    covariate-list?,
    predictor-to-parameter-correlation-matrix?,
    parameter-table)>
<!ATTLIST regression-model

model-id                        CDATA           #REQUIRED
response-variable-name          CDATA           #REQUIRED
number-parameters               %NUMBER;        #REQUIRED
model-type                      (regression | general-linear | log-linear | multinomial-logistic) #REQUIRED
verbose-model-specification     CDATA           #IMPLIED>


<!ELEMENT factor-list (var-name+)>
<!ELEMENT covariate-list (var-name+)>
<!ELEMENT var-name (#PCDATA)>

<!ELEMENT predictor-to-parameter-correlation-matrix (predictor-to-parameter-cell+)>
<!ELEMENT predictor-to-parameter-cell (#PCDATA)>
<!ATTLIST predictor-to-parameter-cell
    predictor-name                  CDATA           #REQUIRED
    parameter-name                  CDATA           #REQUIRED>

<!ELEMENT parameter-table (parameter-cell+)>
<!ELEMENT parameter-cell EMPTY>
<!ATTLIST parameter-cell
    target-category                CDATA           #REQUIRED
    parameter-name                 CDATA            #REQUIRED
    beta                           %NUMBER;         #REQUIRED
    std-error                      %NUMBER;         #IMPLIED
    df                             %NUMBER;         #IMPLIED>


regression-model  - marks the beginning of a multinomial regression model.
factor-list - list of factor names. Will not be present when there is no factor (for example, but not limited to, a linear regression model). Each name in the list must match a name from the dictionary. The factors are assumed to be categorical variables.
covariate-list - list of covariate names. Will not be present when there is no covariate. Each name in the list must match a name from the dictionary. The covariates will be assumed as continuous variables.
model-id - string that uniquely identifies the model. The only requirement is that no two models ever have the same id.
response-variable-name - name of the response variable. must match a name from the dictionary.
number-parameters - number of parameters resulted after the analysis. One of the dimensions of the predictor-to-parameter-correlation-matrix, also one of the dimensions of the parameter-table. While this number can be figured out after reading either of the matrices, it is probably convenient to have it beforehand.
model-type - specifies the type of regression model in use. this information will be used to select the appropriate mathematical formulas during scoring. The supported regression algorithms are listed.
verbose-model-specification - informational item: will describe the model as was specified by the user. Not required for classiffication.

predictor-to-parameter-correlation-matrix - exported only when the regression type (model-type) requires it for scoring. We think of it as a rectangular matrix having a column for each predictor (factor or covariate) and a row for each parameter. The matrix is represented as a sequence of cells.
  • For each predictor variable v and each parameter p, the corresponding cell value is missing (empty)  if there is no correlation between v and p. These empty cells are not exported with the model.
  • If there is a correlation between a covariate predictor and the parameter, the cell value is set to the exponent that the covariate is raised to in the dependency expression. Example: assuming variable jobcat is a factor and work is a covariate, the parameter [jobcat=professional] * work * work is correlated to the covariate work, and the number that should be entered in the cell is 2 because work is present at second power in the expression .
  • If there is a correlation between the factor variable and the parameter, the cell value is set to the value of the factor variable that determines the correlation. Example: assuming the categories of the factor variable jobcat are: professional, clerical, skilled, unskilled, the cell in the matrix that corresponds to (jobcat=skilled, jobcat) has a value of  skilled.
  • All cells determined to be missing from the xml file at model parsing will be assumed to be empty. Since empty cells make up a large chunk of the matrix, this will reduce the size of the exported model.
    predictor-to-parameter-cell - cell in the predictor-to-parameter-correlation-matrix. knows its row name, column name, and information as described above.


    parameter-table - table containing the parameter values along with associated statistics (std error, degrees of freedom). One dimension has the target variable's categories, the other has the parameter names. The table is represented by specifying each cell. There is no requirement for parameter names other than that each name should uniquely identify one parameter.
    parameter-cell - cell in the parameter-table. The target-category and parameter-name attributes determine the cell's location in the parameter table. The information contained is : beta (actual parameter value, required), std-error (standard error, optional), and df (degrees of freedom, optional).


     

    2. Sample exported XML.

    Here is the information about the variables:
     
    Name Type Number of categories Categories (numeric coding in parentheses)
    JOBCAT Response 7 Clerical (1), Office trainee (2), Security officer (3), College trainee(4), Exempt employee(5), MBA trainee (6), and Technical (7)
    SEX Factor 2 Males (0), and Females (1)
    MINORITY Factor 2 White (0), and Nonwhite (1)
    AGE Covariate    
    WORK Covariate    

    The parameter estimates are displayed as follows:

    The predictor-to-parameter-correlation-matrix is:
    Parameter SEX MINORITY AGE WORK
    Intercept . . . .
    [SEX = 0] 0 . . .
    [SEX = 1] 1 . . .
    [MINORITY = 0]([SEX = 0])  0 0 . .
    [MINORITY = 1]([SEX = 0])  0 1 . .
    [MINORITY = 0]([SEX = 1])  1 0 . .
    [MINORITY = 1]([SEX = 1])  1 1 . .
    AGE . . 1 .
    WORK . . . 1

    This predictor-to-parameter combinations mapping is the same for each response category.

    The corresponding XML model is :

    <REGRESSION-MODEL
        MODEL-ID="{ 0xf7292af1, 0x3df1, 0x11d3, { 0xb4, 0xd6, 0x0, 0x60, 0x97, 0x59, 0x4f, 0xa1 } }"
        RESPONSE-VARIABLE-NAME="jobcat"
        NUMBER-PARAMETERS="9"
        MODEL-TYPE="multinomial-logistic"
        VERBOSE-MODEL-SPECIFICATION="NOMREG jobcat BY sex minority  WITH age work /INTERCEPT = INCLUDE  /MODEL = sex minority(sex) age work">

        <FACTOR-LIST>
            <VAR-NAME>sex</VAR-NAME>
            <VAR-NAME>minority</VAR-NAME>
        </FACTOR-LIST>
     

        <COVARIATE-LIST>
            <VAR-NAME>age</VAR-NAME>
            <VAR-NAME>work</VAR-NAME>
        </COVARIATE-LIST>
     

        <PREDICTOR-TO-PARAMETER-CORRELATION-MATRIX>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="sex" PARAMETER-NAME="[SEX=0]">1</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="sex" PARAMETER-NAME="[SEX=1]">2</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="sex" PARAMETER-NAME="[MINORITY=0]([SEX=0])">1</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="sex" PARAMETER-NAME="[MINORITY=1]([SEX=0])">1</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="sex" PARAMETER-NAME="[MINORITY=0]([SEX=1])">2</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="sex" PARAMETER-NAME="[MINORITY=1]([SEX=1])">2</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="minority" PARAMETER-NAME="[MINORITY=0]([SEX=0])">1</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="minority" PARAMETER-NAME="[MINORITY=1]([SEX=0])">2</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="minority" PARAMETER-NAME="[MINORITY=0]([SEX=1])">1</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="minority" PARAMETER-NAME="[MINORITY=1]([SEX=1])">2</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="age" PARAMETER-NAME="age">1</PREDICTOR-TO-PARAMETER-CELL>
            <PREDICTOR-TO-PARAMETER-CELL PREDICTOR-NAME="work" PARAMETER-NAME="work">1</PREDICTOR-TO-PARAMETER-CELL>
       </PREDICTOR-TO-PARAMETER-CORRELATION-MATRIX>
     

        <PARAMETER-TABLE>
            <PARAMETER-CELL TARGET-CATEGORY="1" PARAMETER-NAME="Intercept" BETA="26.836" STD-ERROR="3526.252" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="1" PARAMETER-NAME="[SEX=0]" BETA="-.719" STD-ERROR="3526.250" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="1" PARAMETER-NAME="[MINORITY=0]([SEX=0])" BETA="-19.214" STD-ERROR="1.187" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="1" PARAMETER-NAME="[MINORITY=0]([SEX=1])" BETA="-.114" STD-ERROR="2606.65" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="1" PARAMETER-NAME="AGE" BETA="-.133" STD-ERROR=".086" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="1" PARAMETER-NAME="WORK" BETA="7.885E-02" STD-ERROR=".104" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="2" PARAMETER-NAME="Intercept" BETA="31.077" STD-ERROR="3526.252" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="2" PARAMETER-NAME="[SEX=0]" BETA="-.869" STD-ERROR="3526.250" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="2" PARAMETER-NAME="[MINORITY=0]([SEX=0])" BETA="-18.99" STD-ERROR="1.213" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="2" PARAMETER-NAME="[MINORITY=0]([SEX=1])" BETA="1.01" STD-ERROR="2606.65" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="2" PARAMETER-NAME="AGE" BETA="-.3" STD-ERROR=".091" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="2" PARAMETER-NAME="WORK" BETA=".152" STD-ERROR=".111" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="3" PARAMETER-NAME="Intercept" BETA="6.836" STD-ERROR="4061.421" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="3" PARAMETER-NAME="[SEX=0]" BETA="16.305" STD-ERROR="4061.419" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="3" PARAMETER-NAME="[MINORITY=0]([SEX=0])" BETA="-20.041" STD-ERROR="1.297" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="3" PARAMETER-NAME="[MINORITY=0]([SEX=1])" BETA="-.73" STD-ERROR="3449.165" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="3" PARAMETER-NAME="AGE" BETA="-.156" STD-ERROR=".107" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="3" PARAMETER-NAME="WORK" BETA=".267" STD-ERROR=".124" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="4" PARAMETER-NAME="Intercept" BETA="8.816" STD-ERROR="2862.832" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="4" PARAMETER-NAME="[SEX=0]" BETA="15.264" STD-ERROR="2862.829" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="4" PARAMETER-NAME="[MINORITY=0]([SEX=0])" BETA="-16.799" STD-ERROR="1.546" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="4" PARAMETER-NAME="[MINORITY=0]([SEX=1])" BETA="16.48" STD-ERROR="0.00" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="4" PARAMETER-NAME="AGE" BETA="-.133" STD-ERROR=".091" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="4" PARAMETER-NAME="WORK" BETA="-.16" STD-ERROR=".126" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="5" PARAMETER-NAME="Intercept" BETA="5.862" STD-ERROR="5011.208" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="5" PARAMETER-NAME="[SEX=0]" BETA="16.437" STD-ERROR="5011.207" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="5" PARAMETER-NAME="[MINORITY=0]([SEX=0])" BETA="-17.309" STD-ERROR="1.383" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="5" PARAMETER-NAME="[MINORITY=0]([SEX=1])" BETA="15.888" STD-ERROR="4412.753" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="5" PARAMETER-NAME="AGE" BETA="-.105" STD-ERROR=".090" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="5" PARAMETER-NAME="WORK" BETA="6.914E-02" STD-ERROR=".109" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="6" PARAMETER-NAME="Intercept" BETA="6.495" STD-ERROR="9095.723" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="6" PARAMETER-NAME="[SEX=0]" BETA="17.297" STD-ERROR="9095.722" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="6" PARAMETER-NAME="[MINORITY=0]([SEX=0])" BETA="-19.098" STD-ERROR=".000" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="6" PARAMETER-NAME="[MINORITY=0]([SEX=1])" BETA="16.841" STD-ERROR="8780.225" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="6" PARAMETER-NAME="AGE" BETA="-.141" STD-ERROR=".119" DF="1"/>
            <PARAMETER-CELL TARGET-CATEGORY="6" PARAMETER-NAME="WORK" BETA="-5.058E-02" STD-ERROR=".184" DF="1"/>
        </PARAMETER-TABLE>

    </REGRESSION-MODEL>

    3. Scoring algorithm.

    We will use the above example to illustrate the steps that should be followed in the scoring process. Say the following case (observation) must be scored:

        obs  = (sex=1 minority=0 age=25 work=4)

    1. Do model file parsing. Reconstruct the dictionary, predictor-to-parameter correlation matrix, and the parameter table.
    2. For ease of explanation, we assume the predictor-to-parameter correlation matrix is laid out as in the example. Construct a vector x of length equal to the number of parameters, as follows. Compare the vector obs with each row of the matrix. If there is some factor f, such that row(f)is not missing and  row(f)!=obs(f), set the corresponding value of x to 0. Otherwise, set the value to ¶(obs(cv)row(cv)), where the symbol stands for the product over all covariates cv. In our case, the vector x is x=(1, 0, 1, 0, 0, 1, 0, 25, 4).
    3. For each value j of the target variable (except the last), construct the vector ßj containing the parameter estimates for that value and compute the number Sj= exp(<x, ßj>).
    4. For each value j of the target variable, we can then compute the probability that the response variable assigns the value j: P(jobcat=j) = Sj / (S1+S2+...+S6+1). An assignment can be made to the category that yields the largest probability.