Local Smoothing: a Method of Controlling Error and Estimating Relationships in Consumer Research

Joel Huber, Purdue University
ABSTRACT - Local smoothing works on the assumption that an undefined but continuous function exists between a criterion variable and a set of predictor variables. It estimates the value of each point as a weighted average of points defined as "close" by the predictor variables. It is shown that such a routine has many uses in coping with the high levels of error typically found in consumer research.
[ to cite ]:
Joel Huber (1977) ,"Local Smoothing: a Method of Controlling Error and Estimating Relationships in Consumer Research", in NA - Advances in Consumer Research Volume 04, eds. William D. Perreault, Jr., Atlanta, GA : Association for Consumer Research, Pages: 24-28.

Advances in Consumer Research Volume 4, 1977    Pages 24-28

LOCAL SMOOTHING: A METHOD OF CONTROLLING ERROR AND ESTIMATING RELATIONSHIPS IN CONSUMER RESEARCH

Joel Huber, Purdue University

ABSTRACT -

Local smoothing works on the assumption that an undefined but continuous function exists between a criterion variable and a set of predictor variables. It estimates the value of each point as a weighted average of points defined as "close" by the predictor variables. It is shown that such a routine has many uses in coping with the high levels of error typically found in consumer research.

THE NEED FOR SMOOTHING

In consumer behavior, there is often a need to estimate or display relationships between sets of very noisy data. Grouping or clustering is sometimes used to emphasize a relationship by minimizing the effect of nuisance variables. The problem is that grouping is a tremendously ad hoc procedure. It has been shown by Blalock (1964, p. 33) and others that conclusions as to the strength of a relationship can depend critically on the cut points. Therefore, those researchers concerned with reproducibility of research are understandably reluctant to use a grouping or clustering scheme that requires relatively arbitrary judgments at the time of analysis.

Smoothing estimates the value of a criterion as an average of other observations defined as "close" with respect to one or a set of predictor variables. While it functions to reduce noise in the same way as clustering, it differs from it in that the values of the predictor variables do not change but merely serve as the basis for deciding the relative weighting of the criterion. Various forms of smoothing have been used by researchers. Exponential smoothing of time series data estimates the value of future events by weighting most those events closest in time to the event to he predicted. Geographic smoothing predicts the height or some other descriptor of land as a weighted average of those values which are geographically proximate. (See Brown, 1964 and Shepard, 1970 for descriptions of these methodologies.) Rather than use spatial or temporal measures as the predictor variable, this paper considers the value of using any continuous variable to serve as the basis for smoothing. Thus, for example, a person's purchases would be estimated by averaging over other subjects with similar beliefs, past behavior and demographics. Arbitrariness is avoided by choosing the weighting scheme that best predicts each point in the space. It is argued chat such a methodology is very useful to those in consumer research in dealing with data that contain high levels of error in suggesting functional forms for further analysis, and in estimating the strength of relationship between predictor and criterion variables.

The central assumption of any smoothing routine is that values of the criterion variables are generated by some differentiable function of the predictor variables plus random error. That is:

yi = f(xi) + ei   (1)

where:

yi = the value of the ith criterion variable

xi = a vector of predictor variables on the ith observation

f = a smooth function relating y to x

ei = random error, E(ei) = 0, E(ei . ej) = 0

The object of smoothing is to estimate the expected value of the criterion given the vector of predictors, or E(y|x). This is very similar to regression except that no specific functional form, such as linearity, is assumed. What is assumed is that the function is smooth and differentiable about each point in the relevant range. Smoothing estimates the value of the function for observation i as the average over yj 's which have xj close to xi. Taking the Taylor series expansion of Equation 1 about observation j,

f(xi) = f(xj) + f'(xj)(xi-xj) + f"(xi-xj)2/2+...   (2)

+ higher ordered terms of (xi - xj).

Thus, given that a function exists, for small values of x.-x., the error from using the Taylor is relatively insignificant. The gain series expansion from estimating the function as the average of proximate points derives from the canceling effect of averaging the uncorrelated error. Thus, the problem is co find the relative weights to be given to observations in estimating nearby observations. Too much weight on points that are very close means that there will be few observations over which to smooth disturbances. On the other hand, a very broad net provides plenty of points to minimize the effect of random error, but substantial deviation from the Taylor series approximation. The optimal weighting obviously depends on the characteristics of a data set. In particular, it depends on the number of observations; the more points available for smoothing, the less distant points are needed. The optimal weighting also depends on the absolute value of the higher order derivatives of the function; the greater these are, the greater the need to put the heaviest weight close to the point being estimated.

This paper presents L-SMOOTH as an algorithm to determine the optimal weighting scheme for locally smoothing any given data set. Emphasis in this paper, however, is not so much on the particular algorithm as on the concept of local smoothing as it applies co the kinds of data used in consumer research. In the next sections, the algorithm is described and examples are provided of its use. Then a general section follows discussing the values and pitfalls of any such smoothing scheme.

L-SMOOTH: A GENERAL SMOOTHING PROGRAM

L-SMOOTH estimates each value of each observation, y., as a weighted average of the yj values at other points. The weight, w..-, on any point j to predict i, is inversely related to the standardized distance between -i and j on predictor dimension k. Thus:

EQUATION   (3)

where:

EQUATION   (4)

EQUATION   (5)

The routine determines the best values of a and a weight for each dimension, bk, such that

EQUATION   (6)

is a minimum. This formulation does not have an analytic solution so a search routine is used. In the case of L-SMOOTH, a routine using Hooke and Jeeves (1961) pattern search is used in an algorithm SDRMIN by Buffa and Taubert (1972). This search routine is quite flexible and efficient.

The weighting function requires some explanation. Notice that each point is weighted as the inverse of its standardized distance squared from a target plus a constant, a. Here a is constrained to be strictly greater than zero so that finite weights are given to all points. The smaller a, the greater the relative weight of observations closest to the target. This is illustrated in Figure 1. Notice that decreasing alpha moves the hyperbola to the right and means that the closest observations are given proportionately more weight. A very large alpha would provide almost equal weight to points near and far from the target.

FIGURE 1

THE EFFECT OF a ON THE RELATIVE WEIGHT GIVEN TO OBSERVATIONS AS A FUNCTION OF DISTANCE

Each predictor dimension also has a weight, bk. Since the predictor dimensions have been standardized to have equal variances, the bk are ratio measures of the relative stability of each dimension in predicting the criterion. Thus, bk is analogous to the standardized beta weight in linear regression as an estimate of the relative importance of predictor variables. It also shares with the beta weight a sensitivity to high multicollinearity among the predictor variables. This makes sense; if two predictors are highly similar, it will not make much difference in prediction whether one is used, or the other, or any combination of the two.

In addition to estimates of a and bk, L-SMOOTH provides a measure of fit highly analogous to R2 in regression. It is simply

R2smooth = 1 - var(e)/var(y),   (7)

where var(e) is defined in Equation 6 and var(y) is the variance of the criterion. Thus, R2smooth measures the percent of the original criterion variance that is accounted for by the smoothing. Since the smoothing function is not limited in its functional form, this measure provides an asymptotic upper bound to the fit of any parametric form. This is clearly seen if the number of observations is very large. Then, the smoothing estimate of E(y|x) is simply the average of the y values at a given x value. Obviously, no parametric form will be able to do better than smoothing in the asymptotic case.

In the case where the number of observations is not arbitrarily large, a parametric form may have a higher R2 than smoothing for two reasons. First, the large number of parameters in, say, a quadratic regression may produce a high R2 due to the large numbers of degrees of freedom. Second, smoothing assumes that the same weighting scheme is appropriate throughout the range of the data, so that a weighting scheme might produce large errors in certain regions which a parametric model might avoid. Fortunately, the work done in exponential smoothing of time series (Winter, 1960) indicates that average error is relatively insensitive to the exact smoothing constant used. Therefore, given that the data is not unduly perverse, it is fairly safe to assert to that one can expect little improvement in accuracy moving from a smoothing solution to a parametric one.

Notice that the claim is not being made that a smoothing model is better than a parametric one such as linear regression. Indeed, as a model, it is somewhat worse. Smoothing cannot predict a criterion from a knowledge of the predictor variables alone; it requires the values of other criterion observations with weights defined by the predictors. By contrast, a regression model provides a compact summary of the functional form which allows direct prediction from any set of predictor values. Therefore, a routine such as L-SMOOTH should be used to suggest functional forms and to lessen the effects of high levels of error, but not as a model in itself.

Examples of Smoothing

The first example of local smoothing involves the area where the technique is perhaps of most value to those in consumer research -- dealing with probabilities. The following data set represents a hypothetical set which was used in the initial testing of L-SMOOTH. It illustrates both the advantages and disadvantages of smoothing.

Imagine the following set of data. Twenty consumers are asked to evaluate a product they have just purchased for the first time. The only subsequent data is whether the product was repurchased in the next shopping trip. Smoothing attempts to estimate the relationship between evaluating the product and the probability of subsequent repurchase. The results of the L-SMOOTH analysis are given in Figure 2. Conceptually, what the routine has done is find that weighting of adjacent points that best predicts each point. The least squares solution to this problem is given by the dotted line in Figure 2. The optimal weighting is

wij = 1/D2ij + .49

where D2ij is the distance from the point to be estimated. This weighing scheme produces an improvement in variance accounted for of about 50%. The standard error about each estimate is about .35.

There are several things to notice about the result. Except for the part of the curve in the lower left hand corner, the curve takes a normal sigmoid shape. On examination, the "hump" in the lower left hand corner can be seen to be due to the one subject who rated it poorly but still repurchased. While it might be tempting to remove this outlier, the data as presented provide no credible way to discriminate between a logistic and, say, a linear representation of the data. Like any model that uses a squared error criterion, smoothing is strongly affected by outliers, particularly at the edges of the data set and in regions where the density of observations is low. The issue of outliers is raised again later in this paper. Fortunately their effect becomes minimal as the number of observations increases.

FIGURE 2

SMOOTHING HYPOTHETICAL 0-1 DATA: PROBABILITY OF REPURCHASE V. EVALUATION AFTER TRIAL

An example of smoothing actual data is provided in Figure 3. In this case, data represent fifty tipping occasions at a local restaurant. The dollar tip is smoothed with respect to the total cost of the meal including drinks. The smoothed function represents a good fit with R2smooth = 0.79. The shape of the smoothed function indicates that a linear approximation should work well with this data and further that the intercept is greater than zero. Both of these hypotheses were verified by linear regression. This illustrates how local smoothing one variable on another can provide a good source of hypotheses for future analysis.

Using the same data set, age and the number of people at the table were used to smooth tip paid as a percent of the total bill. The results provided the following information:

TABLE

In this case, both age and the number of people at the table represented ordered category ratings. That is, the number of people is an integer, and the age of the patron was estimated and rounded to the nearest 5. Thus, there were relatively large numbers of observations for each value of the dependent variables. This is reflected in the low value of a, which implies a very heavy weight to those in the same category compared to nearby categories. On the other hand, the R2smooth is not very high indicating that although these smooth variables have some effect on the percent tip, it is not very great. Finally, the age category is given more than three times the weight of the number of people at the table in predicting tip. Thus, age predicts far more variation in the percent tip than does the number at the table.

FIGURE 3

DOLLAR TIP SMOOTHED BY TOTAL BILL

The importance of age contradicted a linear analysis of the same data. Therefore, age alone was used to smooth percent tip. The results are provided in Figure 4. Notice the strong nonlinearities indicating peaks of tipping at ages 30 and 50, but otherwise lower. In this case, local smoothing provided a good way to understand what is obviously a rather complex phenomena, and in particular, why a linear function understated the importance of age.

DISCUSSION

In consumer behavior, a smoothing routine such as L-SMOOTH can be used in three ways. It can be used as a method of analysis in its own right. It can be used as a means of conducting preliminary research by suggesting hypotheses and functional forms for more rigorous analysis. And, finally, it can be used as a means of preparing data for later analysis by reducing the effect of errors and especially outliers. This section discusses the value of smoothing in these contexts and some of the problems or pitfalls that might arise.

As a tool of analysis in its own right, smoothing provides a good description of the data under study. Its global measure of fit, R2smooth, measures the gain in variance by using points close to a given point to estimate that point. This generally provides an upper bound on the amount of variance that could be accounted for by any parametric form. Secondly, the bk's for each dimension are ratio measures of the relative effectiveness of each dimension in predicting the criterion.

FIGURE 4

PERCENT TIP SMOOTHED BY AGE OF PATRON

Currently the routine lacks measures of statistical significance which severely limit the use of smoothing as a terminal method of analysis.

There are, however, several aspects to smoothing that make it ideal for preliminary estimation of relationships between sets of variables. Since smoothing is not limited to any parametric form, the resulting visual display can suggest various transformations of the data. For example, in dealing with binary or multinomial criterion data, smoothing against the dependent variables may indicate whether a logistic transform is going to be needed or whether a linear approximation is adequate within the range of the data. In a case where smoothing might have been useful, Bettman (1974) found probit analysis produced identical results as regression with a 0-1 criterion. This result might have been predicted by smoothing the data and noting that it is approximately linear in the appropriate range.

When smoothing is used to suggest the direction of further analysis, the raw rather than the smoothed data should serve as input to the later analysis. Smoothed data produces an artificially high index of fit since much of the variance about each point has been removed. Furthermore, the parametric routines generally have well established measures of significance that would be overstated with smoothed data.

In using smoothing to suggest relationships, caution must be taken to avoid distortion from outliers. One large outlier will produce a continuous hump or valley in the smoothed solution. Since it is continuous, there is a temptation to infer that it is "really there." Furthermore, when the raw data is placed in a parametric routine that models the hump, it will indeed predict better then those that do not, in that it can account for the outlier. Therefore, the smoothing routine should be used to identify outliers and possibly remove them from the analysis. This is easily done by examining the residuals from the smoothing solution. An outlier should form a pattern of one large residual surrounded by smaller errors of the opposite sign. Outliers can be tentatively identified by looking for residuals whose absolute values are more than three times the standard error of estimate. The analysis can then be rerun with these values deleted.

Because of its effect on significance tests, one would not generally input smoothed data into an analysis such as regression. There may, however, be some benefits from such a move. First, where one's concern is estimation of parameters rather than their statistical significance, presmoothing the dependent variable would be expected to stabilize the estimates of the coefficients and be less affected by outliers. Thus, presmoothing should be used where the primary focus is on robust parameter estimation. Second, when working with binary or multinomial data, smoothing helps remove the sampling error to enable the researcher to concentrate on the functional form of the relationship. Transformations such as the arcsine or the logistic can be applied directly to the smoothed data since mathematically the smoothed probability estimates are constrained to be greater than zero and less than one.

Presmoothing data prior to subsequent analysis can cause problems of interpretation if caution is not used. The effect of any smoothing or grouping by the predictor variable is to heighten the relationship between criterion and predictor while at the same time reducing the relationship between criterion and other variables. Blalock (1964, p. 105) provides an excellent discussion of this problem with respect to different ways to group variables prior to linear analysis. Suppose, for example, one is concerned with the effect of the age of a customer and his distance from a store, on purchases from that store. By presmoothing with respect to location, different age groups are averaged and their effect diminished. A subsequent analysis of age against location-smoothed purchases would result in lower correlation than against the raw data. Thus, smoothing with respect to one set of predictor variables generally increases the strength of the relationship with that set and decreases it with other variables.

The lesson here is that if smoothed or clustered data is to be used in a subsequent analysis, all of the attributes that will go into the final model should be included in the smoothing model. Nakanishi (1976) clustered purchase probabilities by region and then found that ease of travel (distance) is the most important variable in the selection of a store. He also found that the linear coefficients of some of the other attributes, such as perceived quality, were non-significant or had counter-intuitive signs. What happened is that by clustering regionally, the quality ratings were averaged by location so that their effect became lessened or distorted. Thus, smoothing can be used to highlight certain relationships, but at a cost of lessening other relationships.

In summary, local smoothing represents a generalization of the concepts of exponential smoothing of time series and spatial smoothing of geographic data to the needs of those working in consumer research. As a tool, smoothing provides substantive data in its own right; it can suggest functional relationship and identify outliers in a data set to prepare the way for further analysis; and finally, it can, with certain interpretive caveats, be used to prepare data for subsequent analysis. No particular claim is made in this paper for the virtues of L-SMOOTH. Other routines will undoubtedly be developed which will be more efficient and have more useful output. The claim here is that smoothing has not been much used in consumer research, and we are the worse for it.

REFERENCES

James R. Bettman, "A Threshold Model of Attribute Satisfaction Decisions," Journal of Consumer Research, l (September 1974), 30-35.

Hubert M. Blalock, Causal Inference in Nonexperimental Research (New York City: Norton, 1964).

Robert G. Brown, Smoothing, Forecasting and Prediction of Discrete Time Series (New Jersey: Prentice Hall, 1963).

E. S. Buffa and W. H. Taubert, Production-Inventory Systems: Planning and Control (Homewood, Illinois: Irwin, 1972).

R. Hooke and T. A. Jeeves, "Direct Search Solution of Numerical and Statistical Problems," Journal of the Association of Computing Machinery, (April 1961).

Masao Nakanishi, Attitudinal Influence on Retail Patronage Behavior," Advances in Consumer Research, V.III, Beverlee B. Anderson, Editor, Association for Consumer Research, (1976).

Donald S. Shepard, "SYMAP Interpolation Characteristics Computer Mapping as an Aid in Air Pollution Studies, Report L, Lab for Computer Graphics, Harvard University, Cambridge, Massachusetts, 2(1970).

Peter R. Winters, "Forecasting Sales by Exponentially Weighted Moving Averages," Management Science, 6(1960), 324-342.

----------------------------------------