SmartDrill SmartDrill
Mission and Clients
Examples
Case Studies
Tips & White Papers
Analytic Techniques
Data Mining Links
Contact Us

Using multiple modeling techniques on the same data set

Logistic Regression Model

Next, we model the same data set using logistic (instead of linear) regression. Logistic regression is designed specifically for situations in which we have a dichotomous dependent variable, whereas linear regression is typically used where we have a continuous dependent variable (e.g., dollar value). However, as we shall see, the linear model holds up well relative to the logistic regression model.

The following table shows the level of correct classification from the logistic regression model:

 Credit Risk Logistic Regression Model
- Classification Table -

 

Predicted

Observed

Bad risk

Good risk

Percent correct

Bad risk

3,663

1,182

75.60%

Good risk

1,952

25,548

92.90%

Overall:

90.31%


This table shows a correct classification rate for bad credit risks of just over 75%, and a correct classification rate for good risks of nearly 93%. While the previously discussed linear regression model did a somewhat better job of identifying good credit risks (95.5% vs. 92.9%), the logistic regression model is slightly better at identifying the bad credit risks (75.6% vs. 75.0%). Overall, both models perform quite similarly: 90.77% correct for the linear model, 90.31% for the logistic model.

The following chart shows the cumulative probability distribution of predicted scores from the logistic regression model:

Logistic Regression Probability Plot

We can see that at the point where the cumulative distribution of predicted scores approaches 50%, there is a sudden jump in the distribution: there is a wide separation between the predicted bad credit risk scores and the predicted good risk scores. Thus, we see a much "cleaner" separation between predicted bad risk households and predicted good risk households than we did with linear regression.

This is clearly apparent in the following annotated output from the computer program which ran the logistic regression:

 Credit Risk Logistic Regression Model
- Observed Groups and Predicted Probabilities -

   16000 +                                                            +
         I                                                            I
         I                                                            I
F        I                                                           GI
R  12000 +                                                           G+
E        I                                                           GI
Q        I                                                           GI
U        I                                                           GI
E   8000 +                                                           G+
N        I                                                           GI
C        I                                                          GGI
Y        I                                                          GGI
    4000 +                                                          GG+
         I                    G                               G     GGI
         I                B   B                               G G   GGI
         I                B   B                           G   G G  GGGI
-Predicted--------------+--------------+--------------+---------------
 Prob.:   0            .25            .5             .75             1
-Group:   BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
  • Predicted Probability is of Membership for Good Credit Risk group.
  • The Cut Value is probability = .50
  • Symbols: B - Bad credit risk, G - Good credit risk
  • Each Symbol Represents 1,000 Cases. Note: no symbols are shown for probabilities associated with fewer than 1,000 cases.

Therefore, when the time comes to score a prospect file of credit applicants, it would probably be preferable to score the file using the logistic regression model rather than the linear regression model. The former gives a cleaner separation at the 50% point of the cumulative probability distribution, while the latter model's lowest predicted probabilities are too high, and its highest predicted probabilities actually exceed 1.00 (which is unrealistic).

Back Logistic Regression Model next    
Linear Regression Model   CHAID Segmentation Model  Comparison of CHAID vs. Regression Models Conclusions and Implications

Copyright © 1998-2009 SmartDrill. All rights reserved.