Creative Data Mining
Previous | Index
| Next
The value of using multiple modeling techniques on the same data
For this example, we will use two different techniques-Logistic Regression
and CHAID-to build targeting models for a direct mail promotion effort
using data from a prior direct mail outreach program. The dependent variable
(the variable we wish to predict) has two categories, mail responder vs.
non-responder. The predictor variables may include demographics, lifestyle,
relevant behaviors, etc., from proprietary databases and/or syndicated
data sources.
Logistic regression is a type of regression that is appropriate for
generating models to predict a dichotomous dependent variable (e.g., responder
vs. non-responder), as opposed to a continuous dependent variable (e.g.,
dollar spending), for which other techniques such as linear regression
would be employed. Logistic regression is especially useful for scoring
a prospect file because the coefficients generated by this technique assign
to each case (e.g., household) a predicted probability of being a member
of the desired category of the dependent variable (e.g., responders).
These probabilities may range from zero to 100%, providing us with a directly
useful measure of likelihood of being in the desired category of the dependent
variable (unlike linear regression, for example, which generates coefficients
that may not correspond to actual probabilities).
CHAID, or Chi-square Automatic Interaction Detection, is a Classification
Tree technique that not only evaluates complex interactions among predictors,
but also displays the modeling results in an easy-to-interpret tree diagram
which, in addition to allowing us to score a prospect file, also gives
us a clear visual picture of the market structure. The "root"
or "trunk" of the tree represents the total modeling database.
CHAID then creates a first layer of "branches" by displaying
values of the strongest predictor of the dependent variable. CHAID automatically
determines whether, and how, to group the individual values of this predictor
into a statistically meaningful number of broader categories. (E.g., we
may start with ten categories of age, and CHAID might collapse these ten
categories down to only four or five statistically significantly different
age groupings.)
CHAID then creates additional layers of branches off of each age grouping,
using the strongest of the remaining predictors. It continues this branching
procedure until the final branches or "twigs" of the tree have
been generated. If CHAID is being used to generate a predictive market
segmentation model, then these terminal branches are the final market
segments, and each market segment has a model score associated with it.
A typical CHAID model may have anywhere from a dozen or so terminal
segments to as many as 50 or 60, or occasionally even more. The segments
are depicted in the tree diagram as well as ranked in a "gains chart."
Statistics produced by the gains chart make it easy to determine how "deep"
into a file one must go to select prospects representing a given level
of above-average performance (e.g., dollar value, response rate, etc.).
Financial data or assumptions can also be incorporated into the predictive
CHAID model results, to generate various strategic or tactical planning
estimates such as Return on Investment, mail cost savings, etc.
When the dependent variable we are trying to predict has only two
values (e.g., mail responder vs. non-responder), we generate what is called
a "nominal" CHAID model. In such a model, we are able to see
what proportion of each market segment consists of cases in the desired
category of the dependent variable (e.g., mail responders). A relative
performance index can also be generated for each segment, based on the
proportion of that segment that falls into the desired category of the
dependent variable.
The following gains chart is a cumulative results table showing the
CHAID modeling results for our simple example, which has only five segments,
ranked in descending order of percent response to a mailing:
Segment # |
Cumulative
% mailed HH |
Cumulative % of
Responder HH |
1
|
8.2 |
47.6 |
| 2 |
14.8 |
66.1 |
| 3 |
24.8 |
85.1 |
| 4 |
40.7 |
99.4 |
| 5 |
100.0 |
100.0 |
The above chart shows us that the best three segments account for
24.8% of the modeling sample, and 85.1% of all responder households. And
the top 8.2% of all households (segment #1) account for nearly half (47.6%)
of all responder households. Thus, this model does a good job of discriminating
between responder and non-responder households.
The CHAID analysis in our example cuts the modeling sample into fewer
gradations (only five
segments) than the regression model, which assigns a probability score
to each individual household. (Note, however, that CHAID models typically
have many more segments (usually anywhere from 20 to 80) than we created
in our simple example. Thus, CHAID can actually allow us to take rather
fine cuts at a file.)
Therefore, we cannot examine a perfect one-to-one correspondence between
the CHAID model
and the regression model. However, we can select cut points that correspond
to break points
between CHAID segments, and then see what proportion of bad-risk households
are captured
by the logistic regression model at these cut points, since the logistic
regression allows finer cuts at the data than CHAID does.
The following table shows cut points for the approximately top 8%
of the sample (segment #1),
and the approximately top 24% of the sample (segments 1-3):
| Modeling technique: |
Logistic Regression |
CHAID |
|
% of responders
captured @ approx. 24%
of total file households
|
85.1%
@ 22.1% of all HH |
85.1%
@ 24.8% of all HH |
| |
|
|
% of responders
captured @ approx. 8%
of total file households |
42.3%
@ 8.0% of all HH |
47.6%
@ 8.2% of all HH |
This table shows us that the logistic regression model performs slightly
better at greater depth into the file, since it captures the same percent
of risky households (85.1%) as the CHAID model, while going only 22.1%
down into the file (vs. 24.8% for CHAID). However, CHAID outperforms logistic
regression when skimming the cream off the top of the file: when we go
down into only 8.2% of the file, CHAID captures 47.6% of the risky households
(vs. a 42.3% capture rate at 8.0% of the file for logistic regression).
This suggests that CHAID may be the most useful scoring model if we
want to identify the very best prospects for a direct mail effort. But
it may be useful to score the remainder of the file using the logistic
regression model, if we wish to identify good prospects beyond about the
top 8%. Or, as an alternative to double-scoring the file, we can first
generate a CHAID model, which is useful for capturing the complex interactions
between predictors and the dependent variable. We can then follow this
up with a logistic regression model, in which we include the individual
segments from the CHAID model as dummy-variable predictors, along with
the original individual predictor variables used to generate the original
CHAID model.
Using this sequential modeling approach we can, for example, create
dummy variables to represent the results of the CHAID segmentation. If
the CHAID model produced, say, 50 segments, then we would create 49 dummy
predictor variables to represent the CHAID segments. If any given household
falls into segment #1, then we would give that household a code of "1"
on the dummy variable representing segment #1. Households not falling
into segment #1 would get a "0" code on this dummy variable.
We would perform a similar coding procedure for each of the dummy
variables representing segments two through 49. We do not create a dummy
variable to represent segment 50 because we already know that if a household
has a code of "0" for each of the 49 dummy variables, the household
must therefore be in segment 50. (For those familiar with statistical
theory, this appropriately reflects the fact that there are only 49 degrees
of freedom in a 50-segment CHAID model.)
By first performing a CHAID analysis, and then feeding the results
into a subsequent logistic regression model, we can typically improve
the targeting effectiveness of the overall modeling effort by at least
10% to 20% vs. using just logistic regression or CHAID alone. And because
most of the time and effort in a modeling project is typically consumed
by the data acquisition and data preparation phases, the incremental cost
of using more than one modeling technique at the back-end of a project
is relatively small.
Previous | Index
| Next
|