|
|
 |
Creative Data Mining
Previous | Index |
Next
Geographic up-shifting and down-shifting
Geographic up-shifting and down-shifting involve methods that allow
us to take a model that was originally created at one level of geographic
specificity, and apply that model to either a higher or lower level of
geographic specificity:
- Geographic up-shifting applies a household-level targeting model
to micro-geographic-level data
- Geographic down-shifting applies a micro-geographic-level targeting
model to household-level data
Geographic up-shifting: applying a household-level targeting model
to micro-geographic-level data
Using advanced data mining techniques, existing survey research data
can often perform double duty as a bridge to larger-scale geo-demographic
analysis. For example, many times a retailer has attitude and usage data,
as well as key demographic data, from a recently conducted market research
study (e.g., a large-scale market definition study or a large-scale attitude
/ awareness / usage tracking study.) The results of a predictive data
mining analysis of these data can be meaningfully projected onto units
of micro-geography (e.g., postal zip codes, mail carrier routes, census
tracts, etc., in the United States), to assist management with retail
site selection, direct mail promotion targeting, etc.
Thus, one does not have to pay a geo-demographic syndicator a large
fee for geocoding the data, overlaying their proprietary clustering codes,
and analyzing the enhanced data. Instead, one can use much less expensive
census data (which many retailers already have in-house) in conjunction
with proprietary market research data, to achieve powerful results.
Here is a simple example. Let us say that one has conducted a market
research survey that includes items measuring customer loyalty, heaviness
of spending, or usage of a particular retail department. If the survey
also includes standard, key demographic classification questions, then
one can use advanced data mining techniques to build a predictive model.
The dependent variable could be any of the aforementioned loyalty measures,
and the demographics are the predictors in the model. After generating
a satisfactory model, the results of the model can be used to score units
of micro-geography, much the same as one would use modeling results to
score households (or businesses) in a proprietary customer or prospect
database.
The trick is to translate the demographics from the survey respondent
level to the micro-geographic level. Again, to use a simple example, let
us say that one has discovered from the survey-based model that particular
age groups are more loyal (or are heavier spenders, etc.) than other age
groups. Instead of just scoring a household-level or business-level file
using the various categories of age, one can instead weight the model
coefficients by the proportions of a micro-geographic unit's population
falling into each age group.
For example, age groups' coefficients from a regression model can
be multiplied by the proportions of a micro-geographic unit's population
falling into the respective age groups. If a particular age group has
a strong coefficient, and/or they represent a disproportionately large
part of the age groups in the micro-geographic unit, then that unit will
achieve a higher model score.
This scoring procedure proceeds similarly for the various categories
of the other demographic variables from the survey research-based model.
After all micro-geographic units of interest have been scored with all
model parameters, standard tabulation and mapping routines can be used
to perform retail siting, promotion targeting, and even store-level merchandise
mix planning.
This approach works best if one plans ahead by designing the market
research to include demographic items that have the same category breaks
that standard micro-geographic census variables have. And if one already
has a site license that allows usage of geo-demographic cluster codes
from one or more of the popular syndicators, then that is even better.
(In the United States, syndicators such as Consumer Infobase, for example,
have the Claritas PRIZM cluster codes available for appending to proprietary
data files at the household level, and can also furnish census tract codes,
etc., for micro-geographic analysis.) The key point is that advanced data
mining techniques can significantly improve the process of knowledge discovery
and application, whether or not one has a site license for a proprietary
geo-demographic and lifestyle targeting system. And the whole process
can be based on existing proprietary market research data.
Geographic down-shifting: applying a micro-geographic-level targeting
model to household-level data
The previous example showed how the results of household-level or
respondent-level predictive models can be geographically "up-shifted"
and applied to micro-geographic units for real estate siting, promotion
targeting, etc. The next example shows how to do the reverse: "down-shifting"
a micro-geographically-based model to apply it to household-level databases.
For example, if one has sales data at some low level of micro-geography.
(e.g., in the United States this might be at the census tract level),
then one can generate a predictive sales model and apply the model to
household-level data to support a direct mail outreach program or other
targeting effort.
In this example, let us say that a manufacturer, distributor or retailer
of an upscale gourmet food package has a proprietary database containing
unit or dollar sales figures by unit of micro-geography. And suppose that
management wishes to target a newsletter or other type of promotional
mailing, or perhaps some sort of loyalty-generation program or cross-selling
effort, to households that look like good sales prospects for upscale
gourmet food products.
One can start with the micro-geographic sales data, and append available
census or syndicated demographic data to the units of micro-geography.
However, unlike household-level data, these variables will be expressions
of aggregate demographic statistics, such as percent of household heads
aged 35 to 44, or percent of households with incomes above $60,000, etc.
These aggregate statistical variables are used as predictors in a model,
and per-capita or per-household sales figures are used as the dependent
variable that the model is attempting to predict.
One could use a regression-type modeling technique, for example, to
identify demographic characteristics of units of micro-geography associated
with varying levels of per-household dollar sales. The coefficients from
this model will then be applied to the scoring of household-level prospects.
Now let us see how this would actually be accomplished.
During the modeling stage, a single coefficient is generated for each
demographic variable. For example, "percentage of household heads
aged 25-34" is a continuous (quantitative) variable, ranging in value
from zero to 100%, that receives a particular coefficient. Thus, if a
higher percentage of household heads aged 25-34 results in a higher value
on the dependent variable, then the "percentage of household heads
aged 25-34" variable gets a positive coefficient of a particular
magnitude. If, in contrast, a lower percentage of household heads aged
25-34 is associated with a higher value on the dependent variable, then
the "percentage of household heads age 25-34" variable gets
a negative coefficient of some particular magnitude, to represent this
inverse correlation.
Categorical predictors (such as head of household gender or marital
status) can be converted to dummy variables (see discussion of dummy variable
coding in the next section), in which case each dummy variable represents
a single category of the original categorical variable, and each one gets
its own individual coefficient.
Then, when it comes time to score individual households from a list
or customer/prospect file, one simply applies the single coefficient for
"percentage of household heads age 25-34" to each household
having a household head aged 25-34. This will automatically force the
household up or down in the scoring rank, based on whether a higher percentage
of age 25-34 in the original model was positively or negatively correlated
with the dependent variable. (Obviously, the household-level file must
either already contain or be enhanced with appropriate demographic variables
from a syndicator of household-level data.)
In other words., at the household level, a household head aged 25-34
is obviously "100% aged 25-34," and therefore gets a better
score if higher percentages of aged 25-34 in the original model were good,
or gets a worse score if higher percentages of aged 25-34 were bad. Note
also that, although we could do so, we do not need to multiply the coefficient
by 100 to reflect the fact that the household head is "100% aged
25-34," the way we normally do when applying a regression coefficient
to a case's value on the predictor variable, because in this example the
household head is either 100% aged 25-34 or zero percent aged 25-34, and
cannot fall anywhere in between. Similarly, by default, any household
where the head of household is not aged 25-34 would get a zero for this
coefficient.
The scoring of the household-level data proceeds the same way for
each of the other predictor variables. Once the prospect file has been
properly scored, prospects are then selected for promotion based on some
minimum threshold model score which management has chosen.
IMPORTANT: Geographic downshifting is a bit trickier than upshifting.
Specifically, it is important to select a relatively low level of micro-geography
as the unit of analysis (e.g., census tracts in the U.S.), in order to
reduce the chance of an "ecological fallacy." Such a fallacy
can arise when using higher levels of geography (E.g., U.S. zip code,
county, city or state) as the units of observation, because of the existence
of latent variables and the importance of geography (e.g., neighborhood)
as opposed to just demographics.
For example, a model predicting sales potential might indicate that
foreign immigrants are particularly good prospects for a particular retail
product or service. However, it could simply be that immigrants have tended
to settle in areas where market potential is high among the native population;
but the immigrants themselves may not actually have as much market potential
as the aggregate-data model might suggest. In the U.S., many immigrants
tend to settle in the states of New York and California, and specifically
in New York City or Los Angeles. Any model built at the city or state
level of analysis may not give a true picture of immigrants' market potential
for various products or services.
So it is important when using geographic downshifting to use a relatively
low level of geography as the unit of analysis. And even in this case,
it is always important to do a "sanity check" to make sure that
the model makes good intuitive sense. Models used in geographic downshifting
should be developed only by experienced analysts. And, whenever possible,
such models should be checked against other available sources of information.
Previous | Index |
Next
|