# Some Validity and Reliability Issues in the Measurement of Attribute Utilities

^{[ to cite ]:}

Philippe Cattin and Marc G. Weinberger (1980) ,"Some Validity and Reliability Issues in the Measurement of Attribute Utilities", in NA - Advances in Consumer Research Volume 07, eds. Jerry C. Olson, Ann Abor, MI : Association for Consumer Research, Pages: 780-783.

^{[ direct url ]:}

http://acrwebsite.org/volumes/9752/volumes/v07/NA-07

A number of measures that can be used for estimating the validity or reliability of individual attribute utilities are reviewed and discussed. What each of these measures actually estimates and the effect of the number of attributes and of the number of stimuli are discussed. Some results obtained with a data base are then presented.

INTRODUCTION

The estimation of attribute utilities is an important part of consumer research (e.g. Green and Srinivasan 1978) and it therefore follows that the validity and reliability of attribute utilities are important topics. Serving to emphasize the importance of investigating reliability and validity issues in marketing, the February 1979 issue of the __Journal of Marketing Research__ was devoted to a redefinition of and a call for a stronger measurement tradition in marketing.

The main purpose of this paper is to review and discuss several measures of the validity and reliability of attribute utilities, and to apply them to a data base. Only individual level utilities are considered. In consumer research this level of analysis is quite important since individual attribute utilities are often used as such, as in market simulations. Moreover, if individual attribute weights are more valid and reliable, the resulting aggregate weights will also be more valid and reliable. It should be noted that we are concerned only with validity and reliability measures which can be computed with the kind of conjoint analysis data that is collected in many studies. For instance, we do not deal with the prediction of actual choice decisions.

THE DATA

The data base used in this study was collected on three consecutive afternoons from 41 male college juniors and seminars. The product was a limited line men's clothing store. All the selected subjects had had experience in evaluating and purchasing their own clothes. Moreover, they are a logical part of the target market of a men's clothing store in the small university town in which the study was executed. The subjects were told that they would be paid for their full three-day participation in the experiment.

The nine attributes used in the study (Figure 1) were derived from a literature review and store image taxonomy developed by Lindquist (1975). The image categories that he found to have the most empirical and/or theoretical support from earlier researchers and which fit the format of the current study were included for investigation. Two realistic levels were identified for each attribute to provide a dichotomy for the audience of male college students utilized as subjects.

RETAIL IMAGE ATTRIBUTES AND THEIR LEVELS

The subjects were randomly assigned to one of four groups (Table 1) and remained in the same group throughout the three afternoons. The full-profile approach (Green and Srinivasan 1978, p. 107) was used the first two afternoons. Tradeoff matrices (Johnson 1974) as well as direct measurement (using a constant sum scale) were used the third afternoon.

The __first day__ the subjects were asked to rate, on a -3 to +3 seven point scale, an orthogonal array of 16 or 32 store concepts defined along six or nine attributes depending upon the group (see Table 1). The __second day__ they were asked to evaluate an orthogonal array (different from the first one) of 16 store concepts (defined along six or nine attributes). On both days each of the 16 or 32 store concepts were defined on a separate file card containing the store descriptions based upon the dichotomous levels of the six or nine attributes. After examining each of the store concepts the subjects were asked to indicate on the seven point -3 to +3 scale: "How likely is it that you would consider shopping in this store?" On the __third day__ the subjects were asked to fill out 2 x 2 tradeoff matrices consisting of all pairings of the six or nine attributes. In addition they were asked for their direct estimates of the importance of the six or nine attributes on a constant sum scale.

MEASURES OF VALIDITY AND RELIABILITY

Regression was used on the data collected the first or the second day to derive attribute utilities. The observations on the dependent variable were the -3 to +3 seven point ratings, while 0 - 1 dummy variables (one per attribute) defined the independent variables. Other (nonmetric) estimation procedures, such as MONANOVA and LINMAP (Green and Srinivasan 1978, p. 112-114) can be used. However, regression is appropriate because the dependent variable is a scale and recent studies have shown that the predictive validity of regression is quite good compared to the predictive validity of non-metric procedures (Green and Srinivasan 1978, p. 113-114). In what follows we classify validity and reliability measurements under the five following subheadings:

(a) percentage of variance explained in estimation sample,

(b) predictive validity of attribute utilities,

(c) reliability,

(d) intransitivity and interaction between attributes,

(e) convergent validity of attribute utilities with other measures (e.g. direct measures).

All these types of measurement were also included in Scott and Wright's study (1976), except (d).

Percentage of Variance Explained in Estimation Sample

Unexplained variance can come from two main sources: (a) noise (reliability), and (b) the model does not represent the process that produced the data (validity). The goodness of fits measure (R or R^{2} in regression) does not indicate the source of unexplained variance. In the studies reviewed by Scott and Wright, R^{2} is typically between .5 and .9. The amount of variance explained is likely to vary with the number of attributes and with the number of stimuli. It can be expected to decrease with the number of attributes (e.g. six to nine) because of "information overload" (too much information to take it all adequately into account), and with the number of stimuli (16 to 32) because of "task overload" (the more stimuli, the more noise due to boredom, fatigue, and so on). Scott and Wright did find a decrease in R2 ("information overload") from 2 to 3 and to 6 attributes. It should be noted that it is more appropriate to use the __adjusted R ^{2}__ than the sample R

^{2}because it corrects for degrees of freedom (and it provides an estimate of the variance explained in the whole population rather than in a sample).

Predictive Validity of Attribute Utilities

The predictive validity of attribute utilities can be estimated by computing a crossvalidated correlation between predicted and actual Y-values in a holdout sample. Such a correlation is typically lower than the sample or adjusted correlation. However, the "shrinkage" (from sample to crossvalidated correlation) was found to be relatively small in the studies that report it (Scott and Wright 1976, p. 212). The predictive validity should decrease with the number of attributes because of the "information overload" effect. The effect of the number of stimuli is not as obvious because, not only the adjusted correlation, but also the shrinkage from adjusted to crossvalidated correlation can be expected to decrease with the number of stimuli. As a result, it is difficult to predict how many stimuli will produce the best predictive validity. This is likely to vary with such things as the involvement of the respondents, the number of attributes and of levels. However, the best predictive validity is likely to occur for a Z ratio of the number of stimuli to the number of attribute utilities of around 2 to 8, because the shrinkage decreases rapidly for small Z values and more and more slowly as Z increases (Cattin 1980), while the adjusted correlation should keep decreasing with Z because of the increasing "task overload" effect (due to boredom, fatigue and so on)

Reliability

Test-retest reliability can be estimated by correlating the ratings or rankings obtained on the same stimuli at two different times. If there is no change in the process that produced the data, the unexplained variance is due only to noise. Both McCullough and Best (1979) and Acito (1977) obtained relatively high rank correlations with data on 27 or 18 stimuli. The median across respondents was in the .9 to 1.0 range.

One can correlate estimated attribute weights instead of ratings or rankings. The attribute weights can be derived from two sets of ratings or rankings collected at the same time (__split-half reliability__) or two different times __(test-retest reliability__). But then the resulting unexplained variance can be due either to noise or to the fact chat the underlying model does not represent well the process that produced the data. If regression is the procedure used for estimating the attribute utilities, a transformation procedure should be applied to the regression weights because different correlations would be obtained with changes in the sign of the attribute weights corresponding to one or more attributes. To eliminate this problem, the regression weights corresponding to the (n-l) levels of an attribute represented with dummy variables need to be transformed into n attribute utilities in the following fashion: a scalar k representing the utility of the nth level is added to each (n-l) regression weight so that the mean attribute utility (across the n levels) is zero, as in MONANOVA (Kruskal and Carmone 1969). Additionally, each regression weight corresponding to an interval-scaled attribute must be transformed into two attribute utilities: one identical to the regression weight, the other equal in absolute value, but of opposite sign. A correlation obtained after transforming the sets of regression weights in this fashion is not affected by changes in the sign of the utilities corresponding to an attribute.

Reliability (whether test-retest or split-half and whether it is computed on ratings, rankings . or on attribute weights) is expected to decrease with the number of stimuli (up to a certain point only because of the increasing "task overload" effect).

Intransitivity and Interaction Between Attributes

A nonmetric procedure such as Johnson's algorithm (1975) can be used to estimate attribute utilities with trade-off matrix data. With enough data their predictive validity and reliability can than be estimated. But then, tradeoff matrix data can also be used to check how much intransitivity and/or interaction between attributes there appears to be (if any) for a given respondent (How this can be done is not discussed because of space limitations). Although such interactions or intransitivities may be due to noise, they seem more likely to be due to the actual process that produced the data. Hence, the more intransitivity and/or interaction, the less appropriate a compensatory multiattribute model, and the less valid the resulting attribute weights. It should be noted, however, that the respondents evaluate only two attributes at a time. This procedure thus provides a "test" of first-order interactions only. Nevertheless, it could be used at least to identify potential first-order interactions, which could then be included in a model or be used to redefine the attributes.

Convergent Validity

Subjects have been shown not to have much self-insight (Scott and Wright 1976, p. 214; Slovic, Fleissner and Bauman 1972). In particular, they tend to underestimate the weight of relatively important attributes and to overestimate the weight of lesser important attributes. Nevertheless, regression-estimated and direct measures of attribute weights should not be too dissimilar. Otherwise, neither set of weights might have much validity. A correlation can be computed between sets of weights. But here again, the sets of weights must first be transformed using the procedure presented in the discussion on reliability. In this case the unexplained variance may be due to poor reliability and/or poor validity of the regression weights or of the direct measures. Such a correlation is expected to decrease with the number of attributes, and increase up to a certain point with the number of stimuli. Moreover, the predictive validity of regression weights should be greater than the predictive validity of direct measures of weights for a sufficiently large Z ratio( because the shrinkage between sample and crossvalidated correlations decreases with Z).

RESULTS

Some preliminary results obtained with the data base described earlier are now presented.

Percentage of Variance Explained in Estimation Sample

A regression was run for each individual with the data collected the first day and with the data collected the second day. Table 2 shows the average adjusted correlations obtained for each cell of the design. The effect of the task overload and of the information overload can be tested by a two-way ANOVA, using the results obtained with the data collected the first day (Table 2a). The results indicate that the information overload effect is not significant; but the task overload effect is significant at the 1% level (F=7.36; 1, 37d.f.; p=.01). Moreover, the interaction between the two is significant but not at the 1% level (F=5.26; 1, 37d.f.; p=.033). However, there is no a priori reason to suspect such an interaction. But then, the average adjusted correlations (obtained by regression) on the data collected the second day also show an interaction-like pattern (Table 2b), even though in this case the subjects in each group had 16 stimuli to evaluate. A two-way ANOVA on these data show that the main effects are not significant while the interaction is somewhat significant (F=3.66; 1, 37d.f.; p = .063). Hence, it seems that on the average the fit is not as good (in the data provided by the respondents) in cells IB and IIA as in the two other cells, even though the assignment of respondents to cells was random.

Predictive Validity

Even though the task overload effect was found to be significant at the 1% level, the attribute weights obtained with 16 stimuli may or may not have more predictive validity than those obtained with 32 (because 32 stimuli produce more precise estimates than 16 for the same adjusted correlation). The attribute weights obtained with the data collected the first day were used to predict the ratings of the stimuli obtained the second day. The average crossvalidated correlations thus obtained are shown in Table 3. Here again a two-way ANOVA can be performed. The main effects are not significant (nor is the interaction). Hence, the task overload effect found on the adjusted correlations has been washed out. In fact, the average Crossvalidated correlation in cells B (.770) is even slightly higher than in cells A (.732). Thus the, regression estimates obtained with 32 stimuli are better predictors than those obtained with 16 stimuli, but not significantly.

Reliability

Test-retest reliability could not be estimated on any ratings since the respondents never evaluated the same stimuli. But, it could be estimated by correlating the attribute weights obtained from the data collected the first day with those obtained from the data collected the second day (after transforming the attribute weights using the procedure discussed earlier). The average Pearson correlations obtained for each cell of the design are shown in Table 4. As expected, they are greater with six attributes compared to nine ("information over-load effect"), but the difference is not significant. The difference due to the number of stimuli is not significant either.

Interaction and Intransitivity Between Attributes

The tradeoff matrix data collected the third day show little or no interaction and little or no intransitivity in most of the 41 subjects. In fact, only four subjects show a few interactions between pairs of attributes. A non-interactive compensatory model seems questionable mostly for these four subjects.

Convergent Validity

Convergent validity was estimated by computing Pearson correlations between the direct attribute weight estimates obtained the third day, and the regression estimates obtained from the data collected the first and the second day. The sign of the direct estimates was derived from the tradeoff matrix data. Moreover, all sets of weights were transformed using the procedure suggested earlier. The average correlations obtained for each cell of the design are shown in Table 5. As expected, they are greater with six attributes compared to nine ("information overload effect"). A two-way ANOVA shows that this effect is significant in Table 5a (F=-7.84;1, 37d.f.; p = .008), but not in Table 5b. As pointed out earlier respondents have been found to underestimate the importance of important attributes and to overestimate the importance of less important attributes (e.g. Slovic et al. 1972; Scott and Wright 1976, p. 214). To check this we transformed the regression estimates of each respondent so that the sum of their absolute value is 100 (which is also the sum of the direct measures provided by the respondents). We then compared the distribution of the absolute value of the importance weights across respondents. Some results are shown in Table 6. They confirm the results obtained by Slovic et al. and by Scott and Wright.

Finally, the direct measures of attribute importance were used to predict the ratings obtained the second day. The average of the resulting Pearson correlations are shown in Table 7. A comparison of Tables 7 and 3 shows that the regression estimates do not have more predictive validity than the direct measures especially in cells IA and IIA. In fact, the direct measures outperform the regression estimates 7 times out of 10 in cells IA and IIA, but only 4 times out of 11 in cell IB and 5 times out of 10 in IIB. But then, the regression estimates were obtained with the data collected the first day. A comparison of Tables 2 and 3 (cells IA and IB) indicates that the average adjusted correlations obtained by regression were lower on the data collected the first day than on the data collected the second day. In fact, the adjusted correlation (in cells IA and IB) is higher with the second day data 14 times out of 19 (with one tie) which is significant at the 5% level. In other words, there was a learning effect and this is one reason why the results are not favorable to the regression estimates. To reduce the effect of this learning effect, we used the regression estimates obtained with the data collected the second day, and also the direct estimates to predict the scaled values collected the first day. A comparison of the resulting correlations show that the averages are close in all cells, and not significantly different. However, even though respondents generally do not have much self-insight, the direct estimates of attribute importance provided by our respondents seem to have a fair predictive validity.

DISTRIBUTION OF INDIVIDUAL ATTRIBUTE UTILITIES

SUMMARY AND CONCLUDING COMMENTS

A major purpose of this paper was to review and discuss a number of measures that can be used for estimating the validity or reliability of individual attribute utilities. In the process, we discussed the effect of the number of attributes and of the number of stimuli on these measures, and used a data base to confirm or disconfirm our expectations. We also argued that most of these measures estimate either the amount of noise (due to boredom, fatigue, and so on), or how well the underlying model represents the process that produced the data, or both. Hence, there is some redundancy between these measures. Preliminary results obtained with the data base used in this study indicate that this tends to be true. Since it is not practical to use all the measures used in this study what would be useful is a set of guidelines that specify the measures needed for estimating the components of validity and reliability.

REFERENCES

Acito, Franklin (1977), "An Investigation of Some Data Collection Issues in Conjoint Measurement," in __1977 Educators' Proceedings__, Chicago: American Marketing Association, 82-5.

Cattin, Philippe (1980), "A Note on the Estimation of the Squared Cross-Validated Multiple Correlation of a Regression Model," __Psychological Bulletin__ (in press).

Green, Paul E. and V. Srinivasan (1978), "Conjoint Analysis in Consumer Research: Issues and Outlook," __Journal of Consumer Research__, 5, 103-23.

Johnson, Richard M. (1974), "Tradeoff Analysis of Consumer Values," __Journal of Marketing Research__, 11, 121-27.

Johnson, Richard M. (1975), "A Simple Method for Pairwise Monotone Regression," __Psychometrika__, 40, 163-8.

Kruskal, Joseph B. and Frank J. Carmone (1969), "MONANOVA: A Fortran IV Program for Monotone Analysis of Variance," __Behavioral Science__, 14, 165-6.

Lindquist, J. D. (1974), "Meaning of Image." __Journal of Retailing__, 50, No. 4, 29-38.

McCullough, James and Roger Best (1979), "Conjoint Measurement: Temporal Stability and Structural Reliability,'' __Journal of Marketing Research__, 16, 26-31.

Scott, Jerome E. and Peter Wright (1976), "Modeling an Organizational Buyer's Product Evaluation Strategy: Validity and Procedural Considerations," __Journal of Marketing Research__, 13, 211-24.

Slovic, P., D. Fleissner and W. C. Bauman (1972) "Analyzing the Use of Information in Investment Decision Making: A Methodological Proposal," __Journal of Business__, 45, 283-301.

----------------------------------------

Tweet
window.twttr = (function (d, s, id) { var js, fjs = d.getElementsByTagName(s)[0], t = window.twttr || {}; if (d.getElementById(id)) return; js = d.createElement(s); js.id = id; js.src = "https://platform.twitter.com/widgets.js"; fjs.parentNode.insertBefore(js, fjs); t._e = []; t.ready = function (f) { t._e.push(f); }; return t; } (document, "script", "twitter-wjs"));