SmartDrill SmartDrill
Mission & Clients
Examples
Case Studies
Tips & White Papers
Analytic Techniques
Data Mining Links
Contact Us

Creative Data Mining

Previous | Index | Next

The value of using multiple modeling techniques on the same data

For this example, we will use two different techniques-Logistic Regression and CHAID-to build targeting models for a direct mail promotion effort using data from a prior direct mail outreach program. The dependent variable (the variable we wish to predict) has two categories, mail responder vs. non-responder. The predictor variables may include demographics, lifestyle, relevant behaviors, etc., from proprietary databases and/or syndicated data sources.

Logistic regression is a type of regression that is appropriate for generating models to predict a dichotomous dependent variable (e.g., responder vs. non-responder), as opposed to a continuous dependent variable (e.g., dollar spending), for which other techniques such as linear regression would be employed. Logistic regression is especially useful for scoring a prospect file because the coefficients generated by this technique assign to each case (e.g., household) a predicted probability of being a member of the desired category of the dependent variable (e.g., responders). These probabilities may range from zero to 100%, providing us with a directly useful measure of likelihood of being in the desired category of the dependent variable (unlike linear regression, for example, which generates coefficients that may not correspond to actual probabilities).

CHAID, or Chi-square Automatic Interaction Detection, is a Classification Tree technique that not only evaluates complex interactions among predictors, but also displays the modeling results in an easy-to-interpret tree diagram which, in addition to allowing us to score a prospect file, also gives us a clear visual picture of the market structure. The "root" or "trunk" of the tree represents the total modeling database. CHAID then creates a first layer of "branches" by displaying values of the strongest predictor of the dependent variable. CHAID automatically determines whether, and how, to group the individual values of this predictor into a statistically meaningful number of broader categories. (E.g., we may start with ten categories of age, and CHAID might collapse these ten categories down to only four or five statistically significantly different age groupings.)

CHAID then creates additional layers of branches off of each age grouping, using the strongest of the remaining predictors. It continues this branching procedure until the final branches or "twigs" of the tree have been generated. If CHAID is being used to generate a predictive market segmentation model, then these terminal branches are the final market segments, and each market segment has a model score associated with it.

A typical CHAID model may have anywhere from a dozen or so terminal segments to as many as 50 or 60, or occasionally even more. The segments are depicted in the tree diagram as well as ranked in a "gains chart." Statistics produced by the gains chart make it easy to determine how "deep" into a file one must go to select prospects representing a given level of above-average performance (e.g., dollar value, response rate, etc.). Financial data or assumptions can also be incorporated into the predictive CHAID model results, to generate various strategic or tactical planning estimates such as Return on Investment, mail cost savings, etc.

When the dependent variable we are trying to predict has only two values (e.g., mail responder vs. non-responder), we generate what is called a "nominal" CHAID model. In such a model, we are able to see what proportion of each market segment consists of cases in the desired category of the dependent variable (e.g., mail responders). A relative performance index can also be generated for each segment, based on the proportion of that segment that falls into the desired category of the dependent variable.

The following gains chart is a cumulative results table showing the CHAID modeling results for our simple example, which has only five segments, ranked in descending order of percent response to a mailing:



Segment #
Cumulative
% mailed HH
Cumulative % of
Responder HH
1
8.2 47.6
2 14.8 66.1
3 24.8 85.1
4 40.7 99.4
5 100.0 100.0

The above chart shows us that the best three segments account for 24.8% of the modeling sample, and 85.1% of all responder households. And the top 8.2% of all households (segment #1) account for nearly half (47.6%) of all responder households. Thus, this model does a good job of discriminating between responder and non-responder households.

The CHAID analysis in our example cuts the modeling sample into fewer gradations (only five
segments) than the regression model, which assigns a probability score to each individual household. (Note, however, that CHAID models typically have many more segments (usually anywhere from 20 to 80) than we created in our simple example. Thus, CHAID can actually allow us to take rather fine cuts at a file.)

Therefore, we cannot examine a perfect one-to-one correspondence between the CHAID model
and the regression model. However, we can select cut points that correspond to break points
between CHAID segments, and then see what proportion of bad-risk households are captured
by the logistic regression model at these cut points, since the logistic regression allows finer cuts at the data than CHAID does.

The following table shows cut points for the approximately top 8% of the sample (segment #1),
and the approximately top 24% of the sample (segments 1-3):


Modeling technique: Logistic Regression CHAID

% of responders
captured @ approx. 24%
of total file households

85.1%
@ 22.1% of all HH
85.1%
@ 24.8% of all HH
     
% of responders
captured @ approx. 8%
of total file households
42.3%
@ 8.0% of all HH
47.6%
@ 8.2% of all HH

This table shows us that the logistic regression model performs slightly better at greater depth into the file, since it captures the same percent of risky households (85.1%) as the CHAID model, while going only 22.1% down into the file (vs. 24.8% for CHAID). However, CHAID outperforms logistic regression when skimming the cream off the top of the file: when we go down into only 8.2% of the file, CHAID captures 47.6% of the risky households (vs. a 42.3% capture rate at 8.0% of the file for logistic regression).

This suggests that CHAID may be the most useful scoring model if we want to identify the very best prospects for a direct mail effort. But it may be useful to score the remainder of the file using the logistic regression model, if we wish to identify good prospects beyond about the top 8%. Or, as an alternative to double-scoring the file, we can first generate a CHAID model, which is useful for capturing the complex interactions between predictors and the dependent variable. We can then follow this up with a logistic regression model, in which we include the individual segments from the CHAID model as dummy-variable predictors, along with the original individual predictor variables used to generate the original CHAID model.

Using this sequential modeling approach we can, for example, create dummy variables to represent the results of the CHAID segmentation. If the CHAID model produced, say, 50 segments, then we would create 49 dummy predictor variables to represent the CHAID segments. If any given household falls into segment #1, then we would give that household a code of "1" on the dummy variable representing segment #1. Households not falling into segment #1 would get a "0" code on this dummy variable.

We would perform a similar coding procedure for each of the dummy variables representing segments two through 49. We do not create a dummy variable to represent segment 50 because we already know that if a household has a code of "0" for each of the 49 dummy variables, the household must therefore be in segment 50. (For those familiar with statistical theory, this appropriately reflects the fact that there are only 49 degrees of freedom in a 50-segment CHAID model.)

By first performing a CHAID analysis, and then feeding the results into a subsequent logistic regression model, we can typically improve the targeting effectiveness of the overall modeling effort by at least 10% to 20% vs. using just logistic regression or CHAID alone. And because most of the time and effort in a modeling project is typically consumed by the data acquisition and data preparation phases, the incremental cost of using more than one modeling technique at the back-end of a project is relatively small.

Previous | Index | Next

 


Copyright © 1998-2008 SmartDrill. All rights reserved.