SmartDrill SmartDrill
Mission & Clients
Examples
Case Studies
Tips & White Papers
Analytic Techniques
Data Mining Links
Contact Us

Creative Data Mining

Previous | Index | Next

Geographic up-shifting and down-shifting

Geographic up-shifting and down-shifting involve methods that allow us to take a model that was originally created at one level of geographic specificity, and apply that model to either a higher or lower level of geographic specificity:

  • Geographic up-shifting applies a household-level targeting model to micro-geographic-level data
  • Geographic down-shifting applies a micro-geographic-level targeting model to household-level data

Geographic up-shifting: applying a household-level targeting model to micro-geographic-level data

Using advanced data mining techniques, existing survey research data can often perform double duty as a bridge to larger-scale geo-demographic analysis. For example, many times a retailer has attitude and usage data, as well as key demographic data, from a recently conducted market research study (e.g., a large-scale market definition study or a large-scale attitude / awareness / usage tracking study.) The results of a predictive data mining analysis of these data can be meaningfully projected onto units of micro-geography (e.g., postal zip codes, mail carrier routes, census tracts, etc., in the United States), to assist management with retail site selection, direct mail promotion targeting, etc.

Thus, one does not have to pay a geo-demographic syndicator a large fee for geocoding the data, overlaying their proprietary clustering codes, and analyzing the enhanced data. Instead, one can use much less expensive census data (which many retailers already have in-house) in conjunction with proprietary market research data, to achieve powerful results.

Here is a simple example. Let us say that one has conducted a market research survey that includes items measuring customer loyalty, heaviness of spending, or usage of a particular retail department. If the survey also includes standard, key demographic classification questions, then one can use advanced data mining techniques to build a predictive model. The dependent variable could be any of the aforementioned loyalty measures, and the demographics are the predictors in the model. After generating a satisfactory model, the results of the model can be used to score units of micro-geography, much the same as one would use modeling results to score households (or businesses) in a proprietary customer or prospect database.

The trick is to translate the demographics from the survey respondent level to the micro-geographic level. Again, to use a simple example, let us say that one has discovered from the survey-based model that particular age groups are more loyal (or are heavier spenders, etc.) than other age groups. Instead of just scoring a household-level or business-level file using the various categories of age, one can instead weight the model coefficients by the proportions of a micro-geographic unit's population falling into each age group.

For example, age groups' coefficients from a regression model can be multiplied by the proportions of a micro-geographic unit's population falling into the respective age groups. If a particular age group has a strong coefficient, and/or they represent a disproportionately large part of the age groups in the micro-geographic unit, then that unit will achieve a higher model score.

This scoring procedure proceeds similarly for the various categories of the other demographic variables from the survey research-based model. After all micro-geographic units of interest have been scored with all model parameters, standard tabulation and mapping routines can be used to perform retail siting, promotion targeting, and even store-level merchandise mix planning.

This approach works best if one plans ahead by designing the market research to include demographic items that have the same category breaks that standard micro-geographic census variables have. And if one already has a site license that allows usage of geo-demographic cluster codes from one or more of the popular syndicators, then that is even better. (In the United States, syndicators such as Consumer Infobase, for example, have the Claritas PRIZM cluster codes available for appending to proprietary data files at the household level, and can also furnish census tract codes, etc., for micro-geographic analysis.) The key point is that advanced data mining techniques can significantly improve the process of knowledge discovery and application, whether or not one has a site license for a proprietary geo-demographic and lifestyle targeting system. And the whole process can be based on existing proprietary market research data.

Geographic down-shifting: applying a micro-geographic-level targeting model to household-level data

The previous example showed how the results of household-level or respondent-level predictive models can be geographically "up-shifted" and applied to micro-geographic units for real estate siting, promotion targeting, etc. The next example shows how to do the reverse: "down-shifting" a micro-geographically-based model to apply it to household-level databases. For example, if one has sales data at some low level of micro-geography. (e.g., in the United States this might be at the census tract level), then one can generate a predictive sales model and apply the model to household-level data to support a direct mail outreach program or other targeting effort.

In this example, let us say that a manufacturer, distributor or retailer of an upscale gourmet food package has a proprietary database containing unit or dollar sales figures by unit of micro-geography. And suppose that management wishes to target a newsletter or other type of promotional mailing, or perhaps some sort of loyalty-generation program or cross-selling effort, to households that look like good sales prospects for upscale gourmet food products.

One can start with the micro-geographic sales data, and append available census or syndicated demographic data to the units of micro-geography. However, unlike household-level data, these variables will be expressions of aggregate demographic statistics, such as percent of household heads aged 35 to 44, or percent of households with incomes above $60,000, etc. These aggregate statistical variables are used as predictors in a model, and per-capita or per-household sales figures are used as the dependent variable that the model is attempting to predict.

One could use a regression-type modeling technique, for example, to identify demographic characteristics of units of micro-geography associated with varying levels of per-household dollar sales. The coefficients from this model will then be applied to the scoring of household-level prospects. Now let us see how this would actually be accomplished.

During the modeling stage, a single coefficient is generated for each demographic variable. For example, "percentage of household heads aged 25-34" is a continuous (quantitative) variable, ranging in value from zero to 100%, that receives a particular coefficient. Thus, if a higher percentage of household heads aged 25-34 results in a higher value on the dependent variable, then the "percentage of household heads aged 25-34" variable gets a positive coefficient of a particular magnitude. If, in contrast, a lower percentage of household heads aged 25-34 is associated with a higher value on the dependent variable, then the "percentage of household heads age 25-34" variable gets a negative coefficient of some particular magnitude, to represent this inverse correlation.

Categorical predictors (such as head of household gender or marital status) can be converted to dummy variables (see discussion of dummy variable coding in the next section), in which case each dummy variable represents a single category of the original categorical variable, and each one gets its own individual coefficient.

Then, when it comes time to score individual households from a list or customer/prospect file, one simply applies the single coefficient for "percentage of household heads age 25-34" to each household having a household head aged 25-34. This will automatically force the household up or down in the scoring rank, based on whether a higher percentage of age 25-34 in the original model was positively or negatively correlated with the dependent variable. (Obviously, the household-level file must either already contain or be enhanced with appropriate demographic variables from a syndicator of household-level data.)

In other words., at the household level, a household head aged 25-34 is obviously "100% aged 25-34," and therefore gets a better score if higher percentages of aged 25-34 in the original model were good, or gets a worse score if higher percentages of aged 25-34 were bad. Note also that, although we could do so, we do not need to multiply the coefficient by 100 to reflect the fact that the household head is "100% aged 25-34," the way we normally do when applying a regression coefficient to a case's value on the predictor variable, because in this example the household head is either 100% aged 25-34 or zero percent aged 25-34, and cannot fall anywhere in between. Similarly, by default, any household where the head of household is not aged 25-34 would get a zero for this coefficient.

The scoring of the household-level data proceeds the same way for each of the other predictor variables. Once the prospect file has been properly scored, prospects are then selected for promotion based on some minimum threshold model score which management has chosen.

IMPORTANT: Geographic downshifting is a bit trickier than upshifting. Specifically, it is important to select a relatively low level of micro-geography as the unit of analysis (e.g., census tracts in the U.S.), in order to reduce the chance of an "ecological fallacy." Such a fallacy can arise when using higher levels of geography (E.g., U.S. zip code, county, city or state) as the units of observation, because of the existence of latent variables and the importance of geography (e.g., neighborhood) as opposed to just demographics.

For example, a model predicting sales potential might indicate that foreign immigrants are particularly good prospects for a particular retail product or service. However, it could simply be that immigrants have tended to settle in areas where market potential is high among the native population; but the immigrants themselves may not actually have as much market potential as the aggregate-data model might suggest. In the U.S., many immigrants tend to settle in the states of New York and California, and specifically in New York City or Los Angeles. Any model built at the city or state level of analysis may not give a true picture of immigrants' market potential for various products or services.

So it is important when using geographic downshifting to use a relatively low level of geography as the unit of analysis. And even in this case, it is always important to do a "sanity check" to make sure that the model makes good intuitive sense. Models used in geographic downshifting should be developed only by experienced analysts. And, whenever possible, such models should be checked against other available sources of information.

Previous | Index | Next

 


Copyright © 1998-2008 SmartDrill. All rights reserved.