On the Use of Formulas of the Predictive Validity of Regression in Consumer Research

Philippe Cattin, University of Connecticut
ABSTRACT - A frequent measure of the predictive validity of a regression model is the crossvalidated correlation. Estimators of the population crossvalidated correlation can be used. A few such estimators can be found in the psychology literature. They are reviewed. The advantage of these, estimators (over a sample crossvalidated correlation) is that they produce more precise estimates. An example of the use of these estimators in consumer research is presented.
[ to cite ]:
Philippe Cattin (1979) ,"On the Use of Formulas of the Predictive Validity of Regression in Consumer Research", in NA - Advances in Consumer Research Volume 06, eds. William L. Wilkie, Ann Abor, MI : Association for Consumer Research, Pages: 284-287.

Advances in Consumer Research Volume 6, 1979      Pages 284-287

ON THE USE OF FORMULAS OF THE PREDICTIVE VALIDITY OF REGRESSION IN CONSUMER RESEARCH

Philippe Cattin, University of Connecticut

ABSTRACT -

A frequent measure of the predictive validity of a regression model is the crossvalidated correlation. Estimators of the population crossvalidated correlation can be used. A few such estimators can be found in the psychology literature. They are reviewed. The advantage of these, estimators (over a sample crossvalidated correlation) is that they produce more precise estimates. An example of the use of these estimators in consumer research is presented.

INTRODUCTION

In the social sciences in general (and in consumer research in particular) it is often valuable to measure the predictive validity of a regression model. In most instances, one is interested in predicting the Y-value of an object compared to other objects (e.g. a consumer's utility for a product or for a concept) rather than the absolute Y-value of an object. Hence, relative prediction is what matters (rather than absolute prediction). An appropriate measure of predictive validity is the crossvalidated correlation (rather than the mean squared error of prediction). It can be estimated by splitting the available observations into an estimation sample and a validation (or holdout) sample (and computing the Pearson correlation between the actual Y-values of the objects in the validation sample with the Y-values predicted with the regression parameters estimated in the estimation sample). The resulting measure is a sample crossvalidated correlation (e.g. Goldberg, 1971; Scott and Wright, 1976). However, there are also estimators of the population crossvalidated correlation. These estimators are not well known. The purpose of this paper is to review them, to show their advantage over a sample crossvalidated correlation and to illustrate their use in consumer research.

Estimators Of The Population Cross-validated Correlation

Let Yi be an observation on a criterion variable, Xij an observation on one of p predictor variables (j = 1, ..., p), a and bj population parameters and let:

EQUATION  (1)

be a regression model where ei is the disturbance associated to observation i. The population parameters are usually unknown and can be estimated with N observations. If E(ei) = 0 (i = 1, ..., N), and if E(ei2 ) = s2 and E(eiej) = 0 for i j (i, j = 1, . . . N), the Ordi-Least Squares (OLS) estimator of (1) is the Best Linear Unbiased Estimator (BLUE).

There are three mean squared errors that must be distinguished (Darlington, 1968, p. 173). By the same token, there are three correlations: (a) the sample correlation, (b) the correlation produced in the population by the true population weights (which we shall call population correlation), and (c) the correlation produced in the population by the (regression) estimated weights (which we shall call population crossvalidated correlation). The squared sample correlation is:

EQUATION

where Y is the mean of the observations on Y and Y is the regression estimate of observation i. The most common estimator of the squared population correlation is attributed to Wherry (1931):

EQUATION  (2)

This is not an unbiased estimator. However, Montgomery and Morrison (1973) have shown analytically that the maximum bias of (2) is only about .1/N.

Several estimators of the population crossvalidated correlation have been proposed. Recently, Schmitt, Coyle and Rauschenberger (1977) did a Monte Carlo study to compare a couple of estimators. In their simulation design, they did vary the population correlation, the average multicollinearity, the number of predictor variables and the number of observations available for estimation. The levels of each of these variables were representative of studies in the social sciences. They assumed that both criterion and predictor variables are random and normally distributed. The two formulas compared by Schmitt et al. were:

EQUATIONS  (3) and  (4)

These formulas were derived from two unbiased estimators of the population mean squared error of prediction, one assuming fixed predictor variables, the other random predictor variables (formulas (13) and (14) respectively in (Darlington, 1968, p. 173-174). However, as pointed out by Darlington, among others, an unbiased estimator of the population mean squared error of prediction cannot be translated into an unbiased estimator of the population crossvalidated correlation. Hence, (3) and (4) are not unbiased. The results obtained by Schmitt et al. indicate that both (3) and (4) underestimate the true population crossvalidated correlation. Moreover, even though the estimations carried by Schmitt et al. in their simulation assume random predictor variables, (3) seems to produce less biased results than (4) (and (4), not (3), is the formula that is derived from a mean squared error of prediction estimator that assumes random predictor variables). In fact, the average difference between actual and estimated squared population crossvalidated correlations (across all simulation results) is +.0080 with (3) while it is +.0176 with (4) (see Cattin, 1978).

There are at least three other formulas that were carefully derived by Browne (1975), Burket (1964) and Srinivasan (1977). The derivation of Browne's formula assumes random predictor variables while the derivation of Burket's and of Srinivasan's formulas assumes fixed predictor variables. The results given by Schmitt et al. in their article are sufficient to compute estimates of the bias of these formulas. The results (Cattin, 1978) show that the average difference between actual and estimated squared population crossvalidated correlations (across all simulation results) is +.0029 with Browne's formula, +.0015 with Burket's and -.0018 with Srinivasan's (even though Browne's formula, like the estimations carried by Schmitt et al., is the only formula that assumes random predictor variables). These values are substantially closer to zero than the +.0080 obtained with (3). Browne's, Burket's and Srinivasan's formulas thus seem to be less biased.

Browne's formula is:

EQUATION  (5)

where p^2 is the maximum of zero and (2) and p^4 is the (p^2)2-[2p (1-p^2)2/(N-1)(N-p-1)]. Browne has shown by Monte Carlo simulation that the bias of his estimator is relatively small even with a small N/p ratio, except for low correlations. Burket's formula is (p^2)2/R2 where p^2 is an estimator of the squared population correlation. Replacing p^2 by (2) gives:

EQUATION  (6)

(The estimate obtained with this formula should be set equal to zero when p is greater than (N-1) R2).

Srinivasan (1977, p. 64-65) recently argued that, if the mean squared error of prediction estimator that assumes fixed predictor variables (formula (13) in Darlington (1968, p. 173) is to be used to define a squared population crossvalidated correlation formula, two degrees of freedom must be subtracted (from N + p + 1). This is because (a) the value of the intercept can be changed and (b) the slopes can be multiplied by any scalar without changing the resulting correlation between the criterion variable and the predictor variables. The resulting squared population crossvalidated correlation formula is:

EQUATION  (7)

This formula can be rationalized further with the following argument. If there is only one predictor variable (i.e. p= 1), a regression need not be run to get the sample correlation between the two (criterion and predictor) variables. A well-known formula can be used (e.g. formula 10.1.3 in (Winkler and Hays, 1975, p. 645)). Moreover, the population crossvalidated correlation and the population correlation are equal and can be estimated with (2) (where p = l). Hence, (2) and (7) estimate the same thing when p=l, and since (7) actually reduces to (2), formula (7) makes sense. Furthermore, Srinivasan (1977, p. 67-69) has shown that the bias of his formula is relatively small by comparing the values obtained with his formula to those obtained by Schmidt (1970) by simulation. (The estimations carried by Schmidt assumed random predictor variables).

ADVANTAGE OF (5), (6) AND (7) OVER A SAMPLE CROSSVALIDATED CORRELATION

We have reported that (5), (6) and (7) are (slightly biased) estimators of the squared population crossvalidated correlation of a regression model (but seemingly less biased than (3) and (4)). The advantage of these formulas over a sample crossvalidated correlation is that they do not require that the available observations be split into two samples (estimation and validation). The resulting measure of predictive validity is more precise (even though it is slightly biased). This has been shown by simulation by Schmitt et al. (1977, p. 756-757). Moreover, this makes sense intuitively since (5), (6) or (7) takes all the available information into account at once, while a sample crossvalidated correlation cannot.

CHOOSING AMONG MULTIATTRIBUTE MODELS - AN ILLUSTRATION OF THE USE OF (5), (6) OR (7)

In regression one has to choose the form of the relationship between the criterion variable and any (interval or ratio scaled) predictor variable (e.g. Should it be linear, nonlinear? Should dummy variables be used?) Although there may be a priori reasons for selecting a function, one is often uncertain that the most appropriate function is used. If prediction is what matters, (5), (6) or (7) can be used to find out which of two (or more) potential functions seems to have more predictive validity.

The number of regression parameters corresponding to any predictor variable depends upon the assumed relationship with the criterion variable. If a linear function is assumed there is only one parameter. If a nonlinear function is assumed, there may be one or two (or even more) parameters. If dummy variables are used, the number of parameters is (k-l) where k is the number of levels the predictor variable takes; hence, it can be one, two or more. The sample correlation typically increases with the number of parameters to estimate. However, formula (5), (6) or (7) shows that the shrinkage between sample correlation and crossvalidated correlation increases with the number of parameters. Hence, the predictive validity of a model may or may not increase when an assumed linear function is replaced by (say) a quadratic function or by dummy variables.

An example will now be used to illustrate the use of formulas (5), (6) and (7). The data were taken out of an article by Green (1973). The predictor variables are the research, the teaching and the institutional contribution of a University Assistant Professor. Teaching and institutional contribution take on three levels: "below average", "average" and "superior". Research takes on the same three levels and "outstanding". This defines (3 x 3 x 4) 36 hypothetical Assistant Professors. Green (1973, Table 1, p. 411) reports the response ratings of a subject in terms of his subjective probability (ranging from 0 to 100%) of recommending each Professor for a tenured faculty position. In a multiattribute context, these response ratings represent the observations on the criterion variable. A number of multiattribute models can be hypothesized depending upon the attribute utility function (including dummy variables) assumed for research, teaching and institutional contribution. Each model can in turn be estimated by regression.

For illustrative purposes let us consider two models: one using dummy variables for each attribute, the other assuming a linear function for each attribute. The dam-my variables model has seven parameters, since Ei (ki-1) = 7 (where ki is the number of levels of attribute i).

On the other hand, the linear model has three parameters (one per attribute). Since outstanding, superior, average and below average correspond to the 98, 80, 50 and 20 percentile level respectively (as compared to all academics throughout the U.S. at similar career points in similar areas of specialization), we shall use these values as our observations on the predictor variables to estimate the linear model.

In a first step, each model was estimated by regression using all 36 observations. The squared sample correlation of the dummy variables model is slightly superior: .922 vs. .918 (see Table 1A). However, the estimate of the squared population crossvalidated correlation of the linear model is somewhat higher than the corresponding estimate of the dummy variables model, whether we use (5), (6) or (7) (see Table 1A). Hence, the linear model seems to have more predictive validity. In other words, if we had another set of observations (provided by the judge who produced the 36 observations we used) we are likely to predict their Y-value more accurately with the linear model than with the dummy variables model.

In a second step, the 36 observations were split randomly into two subsamples of 18 observations each. Each subsample was used alternatively as estimation sample and as validation sample. The validation sample was used to compute a sample crossvalidated correlation. Moreover, (5), (6) and (7) were used to get estimates of the population crossvalidated correlation. The results are shown in Table 1B. When the first subsample is the estimation sample, the sample crossvalidated correlation and the estimate obtained with (5), (6) and (7) give an edge to the linear model. When the second subsample is the estimation sample, only the sample crossvalidated correlation and the estimate obtained with (5) give an edge to the linear model. However, the average of the two gives an edge to the linear model whichever criterion is used. But moreover, the results obtained in the first step are more precise. All the available information is taken into account at once which leads to more precise estimates (as shown by simulation by Schmitt et al. (1977)).

The results also show that the estimates obtained with (5), (6) and (7) are quite close except in the case of the dummy variables model when only 18 observations are used for estimation (Table 1B). (When the second sub-sample is the estimation sample, the estimate obtained with (5) is .839 while it is .863 and .869 with (6) and (7) respectively). In this case the number of parameters is 8 (including the intercept). Hence, the ratio N/(n + l) is only 2.25. The estimates obtained with (5), (6) and (7) can differ substantially when this ratio is small.

TABLE 1

R-SQUARES AND SQUARED CROSSVALIDATED CORRELATIONS

SUMMARY

In consumer research it is often valuable to know the predictive validity of a regression model. An appropriate measure is the crossvalidated correlation. Estimators of the population crossvalidated correlation can be used. A few such estimators can be found in the psychology literature. They were reviewed. The advantage of these estimators (over a sample crossvalidated correlation) is that they produce more precise estimates. An example of the use of these estimators in consumer research was presented.

REFERENCES

M. W. Browne, "Predictive Validity of a Linear Regression Equation," British Journal of Mathematical and Statistical Psychology, 28(1975), 79-87.

G. R. Burket, "A Study of Reduced Rank Models for Multiple Prediction," Psychometric Monographs, (1964, No. 12).

P. Cattin, "On Formulas of Crossvalidated Multiple Correlation,'' Working Paper: Center for Research and Management Development, University of Connecticut, (1978).

R. B. Darlington, "Multiple Regression in Psychological Research and Practice," Psychological Bulletin, 69 (1968), 161-182.

L. R. Goldberg, "Five Models of Clinical Judgment: An Empirical Comparison Between Linear and Nonlinear Representations of the Human Inference Process," Organizational Behavior and Human Performance, 6(1971), 458-479.

P. E. Green, "On the Analysis of Interactions in Marketing Research Data," Journal of Marketing Research, 10(1973), 410-420.

D. B. Montgomery and D.C. Morrison, "A Note on Adjusting R2,'' Journal of Finance, 28(1973), 1009-1013.

F. L. Schmidt, "The Relative Efficiency of Regression and Simple Unit Predictor Weights in Applied Differential Psychology," Unpublished Doctoral Dissertation. West Lafayette, Indiana: Purdue University, 1970.

N. Schmitt, B. W. Coyle, and J. Rauschenberger, "A Monte Carlo Evaluation of Three Formula Estimates of Cross-validated Multiple Correlation," Psychological Bulletin, 84(1977), 751-758.

J. E. Scott, and P. Wright, "Modeling an Organizational Buyer's Product Evaluation Strategy: Validity and Procedural Considerations," Journal of Marketing Research, 13(1976), 211-224.

V. Srinivasan, "A Theoretical Comparison of the Predictive Power of the Multiple Regression and Equal Weighting Procedures," Research Paper No. 347: Graduate School of Business, Stanford University, (1977).

R. J. Wherry, Sr., "A New Formula for Predicting the Shrinkage of the Coefficient of Multiple Correlation," Annals of Mathematical Statistics, 2(1931), 440-457.

----------------------------------------