![]() |
![]() |
|
Using multiple modeling techniques on the same data setLogistic Regression ModelNext, we model the same data set using logistic (instead of linear) regression. Logistic regression is designed specifically for situations in which we have a dichotomous dependent variable, whereas linear regression is typically used where we have a continuous dependent variable (e.g., dollar value). However, as we shall see, the linear model holds up well relative to the logistic regression model. The following table shows the level of correct classification from the logistic regression model:
Credit Risk Logistic Regression Model
|
|||||||||||||||||||||||||||||||||||||||
|
Predicted |
|||
|
Observed |
Bad risk |
Good risk |
Percent correct |
|
Bad risk |
3,663 |
1,182 |
75.60% |
|
Good risk |
1,952 |
25,548 |
92.90% |
|
Overall: |
90.31% | ||
This table shows a correct classification rate for bad credit risks of
just over 75%, and a correct classification rate for good risks of nearly
93%. While the previously discussed linear regression model did a somewhat
better job of identifying good credit risks (95.5% vs. 92.9%), the logistic
regression model is slightly better at identifying the bad credit risks
(75.6% vs. 75.0%). Overall, both models perform quite similarly: 90.77%
correct for the linear model, 90.31% for the logistic model.
The following chart shows the cumulative probability distribution of predicted scores from the logistic regression model:

We can see that at the point where the cumulative distribution of predicted scores approaches 50%, there is a sudden jump in the distribution: there is a wide separation between the predicted bad credit risk scores and the predicted good risk scores. Thus, we see a much "cleaner" separation between predicted bad risk households and predicted good risk households than we did with linear regression.
This is clearly apparent in the following annotated output from the computer program which ran the logistic regression:
16000 + +
I I
I I
F I GI
R 12000 + G+
E I GI
Q I GI
U I GI
E 8000 + G+
N I GI
C I GGI
Y I GGI
4000 + GG+
I G G GGI
I B B G G GGI
I B B G G G GGGI
-Predicted--------------+--------------+--------------+---------------
Prob.: 0 .25 .5 .75 1
-Group: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
- Predicted Probability is of Membership for Good Credit Risk group.
- The Cut Value is probability = .50
- Symbols: B - Bad credit risk, G - Good credit risk
- Each Symbol Represents 1,000 Cases. Note: no symbols are shown for probabilities associated with fewer than 1,000 cases.
Therefore, when the time comes to score a prospect file of credit applicants, it would probably be preferable to score the file using the logistic regression model rather than the linear regression model. The former gives a cleaner separation at the 50% point of the cumulative probability distribution, while the latter model's lowest predicted probabilities are too high, and its highest predicted probabilities actually exceed 1.00 (which is unrealistic).
| Logistic Regression Model | ||||
| Linear Regression Model | CHAID Segmentation Model | Comparison of CHAID vs. Regression Models | Conclusions and Implications |
Copyright © 1998-2009 SmartDrill. All rights reserved.