Advances in Consumer Research Volume 2, 1975 Pages 741-750
VALIDITY AND GOODNESS OF FIT IN DATA ANALYSIS
Donald R. Lehmann, Columbia University
[Donald R. Lehmann is Associate Professor at Columbia University Graduate School of Business.]
This paper describes the relationship between the truthfulness and usefulness of a model or hypothesized relationship between two or more variables and the goodness of fit which results from an empirical investigation of the model or hypothesis. Several obvious discrepancies between model "validity" and goodness of fit results due to such causes as measurement problems, model mis-specification, and stochastic behavior are described. In addition, a procedure for assessing the appropriateness of a linear regression model is presented.
The first step in formal research is often the specification of a model, either explicitly or implicitly. Data is then collected and the model rejected if the data and the model conflict. In the social sciences, this type of model investigation has been labeled predictive testing (Basmann, 1965) . This approach consists of deducing conclusions from the model and seeing if the data supports the conclusions.
The predictive testing approach, in spite of its obvious appeal, is not used widely in most research on consumer behavior. One reason for this is that in much research, a model is not extant prior to data analysis. Even when a model is present, it is often poorly specified. For example. A could be hypothesized to:
1) be related to B
2) be related to B in some specific mathematical way (eg., A = c + dB)
3) be related to B in some specific mathematical way with constraints on the parameters (eg., A = c + dB where c > 1 and 0 < d < 1)
Moreover, it is possible to specify the relationship between A and B to be causal or merely correlational and to assume a large random component exists. Hence lack of certainty over what the model is makes predictive testing very difficult.
Even when a model is specified, the predictive testing approach is often difficult to utilize. In the first place, it requires considerable ingenuity to deduce meaningful conclusions from models which can be falsified by data. This is especially true for models which do not assume causality. Secondly, in many cases several competing models exist, and with notable exceptions (Bass & Clarke, 1972), more than one of them can fail to be rejected. Hence because of difficulties with the predictive testing approach, as well as different training, most research on consumer behavior has tended to focus on goodness of fit measures as an indicator of useful relationships.
Goodness of fit measures have some obvious advantages in model development. First, they are quantitative indices which can be compared across models so that the best of a set of models can be selected. Secondly, there exist established statistical procedures for testing these measures for significance. Third, they can be used as a means of deducing new or modified models from a set of data. Fourth, they appear as output in canned computer programs, which is no small reason for their prominence in reported research results. Yet in spite of their advantages, goodness of fit measures have some important limitations as indicators of the truthfulness and usefulness of models.
LIMITATIONS OF GOODNESS OF FIT MEASURES
It is very easy to rely on goodness of fit measures as a means of evaluating model usefulness. Unfortunately, there are a variety of reasons why perfect goodness of fit measures should not be expected from useful models. These reasons include:
Imperfect Model Operationalization
One obvious situation where goodness of fit measures are not appropriate for estimating a model's usefulness is when the operationalization of the model does not match the model. In other words, the operationalization of the theoretical model may be due to the type of data available for examining the model. Criticisms of research on this basis are widespread (Cohen, et al., 1972; Taylor & Gutman, 1973). On the other hand, a high goodness of fit measure suggests that a useful model which is equally or even more valuable than the original model may be developed because of its empirical goodness of fit.
Goodness of fit measures' ability to estimate a model's usefulness are obviously limited to the population represented by the sample. Hence, for example, if for some reason college students fit a model differently from the general population, studies of "40 college sophomores" are potentially misleading.
A variety of measurement problems occur which will affect the size of goodness of fit measures. For example, individuals may not treat an apparently equal-interval scale as such (Stevens, 1946). In many other situations, a lower order scale (eg., ordinal) will be treated by the researcher as a higher order scale (eg., interval or ratio). In situations where non-interval scales are used in regressions as though they were interval scales, the practical maximum for R2 even assuming a valid model is often far less than 1 (Lehmann, 1972; Morrison, 1973, 1972a, 1972b). On the other hand, questionnaires may induce consistent responses across questions. When such consistency bias is present, the goodness of fit measures among the affected variables are artificially increased.
Minor Model Mis-Specification
It is possible that a model be correctly specified in the sense of containing the correct constructs and links among constructs while incorrectly specifying the mathematical form of the relationships. In such cases, reliance on goodness of fit measures may lead to inappropriate conclusions.
For example, assume attitude is determined by a multi-attribute attitude model which is multiplicative with unequal weights on the attributes: EQUATION where Bji represents the distance from the desired amount of the ith attribute possessed by alternative j, Wi is the importance weight of the ith attribute, and Aj is the individual's attitude toward the jth alternative. In this case, application of the more popular linear model EQUATION can lead to the conclusion that equal weights are superior. Assume two alternatives, M, which is one unit from the desired level on attribute l and four units from the desired level on attribute 2, and N, which is 4 and 2 units from the desired levels on the two attributes. According to the true model (multiplicative with unequal weights), M is the preferred alternative. If the linear model is applied, however, the assumption of the true weights suggests that M and N are equally preferred while the use of equal weights correctly identifies M as the preferred alternative. Hence reliance on goodness of fit would falsely suggest that the importance weights for the two attributes were equal.
A similar problem exists with comparing linear and non-linear forms of relationships. Given the degree of error or randomness in most survey research data (Hulbert & Lehmann, forthcoming), the property of a Taylor series guarantees that most of the relationship between two variables will be accounted for by the first (linear) term. Hence, unless orthogonal polynomials are employed, second and higher order terms will not appear to be very significant due to their high collinearity with the linear term. For this reason, goodness of fit measures tend to indicate that relations among constructs are essentially linear when in fact the relations may not be.
It is very possible that a model will produce a high goodness of fit while being essentially unrelated causally. This can happen because of a spurious correlation among two variables (Blalock, 1964). Hence, two variables A and B can appear related since a third variable C affects both A and B. Unfortunately, the goodness of fit level gives no indication of whether or not a spurious relationship exists.
In many cases a potentially valid model may be applied in an inappropriate way. For example, estimation of multi-attribute models across individuals by means of regression, though popular (Wilkie & Pessemier, 1973), seems an inappropriate application of an individual model (Beckwith & Lehmann, 1973). In such cases, the goodness of fit of the model does little to indicate its usefulness.
In many situations, a model is essentially developed from data rather than tested by it. Hence the goodness of fit of such a model is biased upward. This is especially true when stepwise procedures are employed.
Degree of Freedom Limitations
Goodness of fit measures are obviously biased upward whenever many parameters are estimated in relation to the size of the sample. The obvious remedies to this problem include the reporting of adjusted R2's and using split-half procedures (Morrison, 1969). Unfortunately, split-half procedures require a relatively large sample size which if extant tends to make the goodness of fit bias fairly small.
Many individuals have recently focused attention on the proposition that a considerable portion of individual behavior is random and hence in principle not explainable (Bass, 1974). This means that models with low goodness of fit indices may be both true and useful (Bass, et al., 1968). It also suggests that models with extremely high R2's may be modeling error or random behavior rather than the true underlying process.
THE APPROPRIATENESS OF R2
Probably no measure of goodness of fit receives more attention than the R2 in regression analysis. Most people feel that a good R2 is a big R2, largely due to the influence of economists who are used to dealing with aggregate time-series data. This section will proceed to argue that R2 is a poor indicator of whether a linear model is useful and appropriate.
Much of the research on consumer behavior has resulted in R2's in the .05 and .10 range. As such, it had indicated little individual-level predictive power but significant relationships among variables such as income and TV viewing time. These variables are usually measured on a 5-8 point scale. Hence it is not at all surprising that people with incomes between 5 and 10 thousand dollars vary considerably in the amount of time they spend watching TV. Figure 1 represents a typical situation in survey research.
HYPOTHETICAL REGRESSION EXAMPLE
At this point the question of exactly what model is being tested becomes important. If the model is that income is a deterministic predictor of TV viewing, then the model can be rejected. If, on the other hand, the model says TV viewing is related to income with a large random component, then the R2 becomes a poor indicator of the model's usefulness (Bass. et al.. ]968) .
Assuming the model is stochastic, the real question becomes whether average TV viewing differs across several income groups. This is essentially an analysis of variance problem. An interesting related question, however, is whether income and average TV viewing are linearly related. In order to address this question, it is useful to first partition the variance in the normal manner:
Total variance = Within variance + Between variance.
Next, by partitioning the Between variance we achieve the following situation (Lehmann, 1974):
Total variance = Within variance + Regression Explained variance + Regression Unexplained variance.
Recalling Figure 1, the within variance (which is generally by far the largest component of the variance) is essentially random and unexplainable by the independent variable. Since R2 is:
Regression Explained Variance/Total Variance, it is necessarily low. A more suitable measure of the appropriateness of a linear model would be:
pR2 = Regression Explained Variance/Between Variance
This PR2 would be 1 if the means of the segments fell on a straight line, and hence it measures how well a linear function estimates the effect of the independent variable on the dependent variable.
Two other points about pR2 are important to recognize. First, pR2 is the R2 which would result if the mean behavior for each segment were regressed against the value(s) for each segment on the independent variable(s) with the means weighted by the number of observations in each segment. Second, departures from 1 by R2 can be tested to see if a significant non-linearity exists in the relationship. The null hypothesis of a linear relationship is tested as follows:
Regression UnexPlained Variance/df / Within Variance/df2 .
Derivations for these results are available (Lehmann, 1974).
AN EMPIRICAL EXAMPLE OF PR2
This example is based on a mail survey of 513 homemakers taken in 1968. The relationship between average TV viewing (measured on a 7-point scale) and income (measured on a 6-point scale) appears as Figure 2. From this figure, it appears that income and TV viewing are linearly related. A simple regression produces an R2 of only .048, however, which is neither very encouraging nor enlightening. In order to see if the relationship is really linear, pR2 was calculated (Table 1). The results indicate a pR2 of .878, which is close to 1. The
TV VIEWING VS. INCOME
TV VIEWING VS. INCOME ANOVA
F test gives F = 18.871/4 / 2679.218/507 = 0.9, which is not significant, and and hence the relationship, as thought, is essentially linear.
VALUE OF PR2
The value of the pR2 measure, therefore, is that it indicates whether a linear model appropriately expresses the relationship among a dependent and a set of independent variables. It does not, however, indicate the relative predictive value of independent variables. For example, a pR2 Of .98 could result from variables which are not as useful in prediction as another set with a lower PR2. Hence pR2 merely indicates how well a linear and additive (non-interactive) model explains the effect of a set of discrete independent variables on a dependent variable.
A variety of reasons exist which make the correspondence between goodness of fit and model usefulness less than perfect. This does not mean that goodness of fit measures are of little value in evaluating or developing a model. It does mean, however, that care is required in using goodness of fit measures for model evaluation. This suggests:
1. Be more explicit about exactly what the model is. This would include reporting post-hoe models as such.
2. Consider exactly how high a goodness of fit measure is reasonable given random behavior and measurement error. This means being suspicious of tautologies when R2's get above .6 in cross-sectional regressions of survey data.
3. Attempt whenever possible to utilize predictive testing to examine model validity.
4. Consider using special goodness of fit measures such as pR2 for examining the usefulness of stochastic models.
In summary, goodness of fit measures are useful tools in the hands of sophisticated researchers. A high goodness of fit, however, is neither a sufficient nor even a necessary condition for model usefulness.
Basmann, R. L. On the application of the identifiability test statistic in predictive testing of explanatory economic models. The Econometric Annual of the Indian Economic Journal, 1965, 13, 387-423.
Bass, F. M. Unexplained variance in studies of consumer behavior. In J. U. Farley and J. A. Howard (Eds.), Controlling noise in marketing data. Forthcoming.
Bass, F. M. The theory of stochastic preference and brand switching. Journal of Marketing Research, 1974, 11, 1-20.
Bass, F. M. & Clarke, D. G. Testing distributed lag models of advertising effect. Journal of Marketing Research, 1972, 9, 298-308.
Bass, F. M., Tigert, D. J., & Lonsdale, R. T. Market segmentation-Group versus individual behavior. Journal of Marketing Research, 1968, 5, 264-270.
Beckwith, N. E. & Lehmann, D. R. The importance of differential weights in multiple attribute models of consumer attitude. Journal of Marketing Research, 1973, 10, 141-145.
Blalock, H. M. Causal inferences ln nonexperimental research. New York: W. W. Norton. 1964.
Cohen, J. B., Fishbein, M., & Ahtola, O. T. The nature and uses of expectancy-value models in consumer attitude research. Journal of Marketing Research, 1972, 9, 456-460.
Hulbert, J. & Lehmann, D. R. Assessing the importance of the sources of error in structured survey data. In J. U. Farley & J. A. Howard (Eds.), Controlling noise in marketing data. Forthcoming.
Lehmann, D. R. An index for assessing the appropriateness of a linear regression model. Columbia University Graduate School of Business, working paper, 1974.
Lehmann, D. R. Preference among similar alternatives. Decision Sciences, 1972, 3, 64-82.
Morrison, D. G. Evaluating market segmentation studies: The properties of R2. Management Science, 1973, 11, 1213-1221.
Morrison, D. G. Regression with discrete random variables: the effect on R2. Journal of Marketing Research, 1972a, 9, 338-340.
Morrison, D. G. Upper bounds for correlations between binary outcomes and probabilistic predictions. Journal of the American Statistical Association, 1972b, 67, 68-70.
Morrison, D. G. On the interpretation of discriminant analysis. Journal of Marketing Research, 1969, 6, 156-163.
Stevens, S. S. On the theory of scales of measurement. Science, 1946, 103, 677-680.
Taylor, J. & Gutman, J. A reinterpretation of Farley and Ring's test of the Howard-Sheth model of buyer behavior. In S. Ward and P. Wright (Eds.), Advances in consumer research. Vol. 1. Urbana, Ill.: Association for Consumer Research, 1973. Pp. 438-446.
Wilkie, W. L. & Pessemier, E. A. Issues in marketing's use of multiattribute attitude models. Journal of Marketing Research, 1973, 10, 428-441.