Discussion Paper: Issues in Survey Measurement

Gordon G. Bechtel, University of Florida
[ to cite ]:
Gordon G. Bechtel (1979) ,"Discussion Paper: Issues in Survey Measurement", in NA - Advances in Consumer Research Volume 06, eds. William L. Wilkie, Ann Abor, MI : Association for Consumer Research, Pages: 545-547.

Advances in Consumer Research Volume 6, 1979      Pages 545-547

DISCUSSION PAPER: ISSUES IN SURVEY MEASUREMENT

Gordon G. Bechtel, University of Florida

INTRODUCTION

The papers in the present session all address biases in attribute ratings of attitudinal objects. However, each set of investigators has chosen different object domains and biases for study. Best, Hawkins and Albaum look at ratings of university institutions as a function of the number of scale categories. Miller and Turner study the effects of perceived survey sponsorship upon the attribute ratings of banking service. Finally, Horowitz and Golob investigate shifts upon attribute scales attributable to unreliable survey respondents. Their attitudinal objects are transportation vehicles, both of the kind we know and futuristic. After a consideration of each paper in turn, an attempt will be made to gain some broader perspectives, as well as directions for future research upon categorical survey responses.

THE EFFECT OF VARYING RESPONSE INTERVALS ON THE STABILITY OF FACTOR SOLUTIONS OF RATING SCALE DATA

Here the treatment conditions consist of two scale formats, one with five categories, and the other a continuous graphic scale (theoretically with an infinite number of categories). The authors study two types of effects, univariate and multivariate, which represent important issues in contemporary survey measurement.

Univariate Effects

Under each response format the treatment group rated three objects (e.g., the university bookstore) upon ten attributes (e.g., dependable). Figure 1 plots the 30 item means under the discrete format upon the corresponding means observed under the continuous format. Although the form is linear and the correlation is .92, we should note that this univariate equivalence is still rather modest, especially since these rating scale means represent aggregate measurement. For example, Jones (1960) used the Method of Successive Intervals in large-scale consumer surveys involving 20 food items. Two rating forms were employed, one containing nine categories and the other six, with labels representing varying degrees from "dislike" to "like". Each sample size was approximately 900, and the plot of the six category scale upon the nine category scale was almost a perfect straight line. Of course, we are left to wonder whether this result is due to Jones' superior sample size or to the superiority of the Method of Successive Intervals as an alternative to the more usual use of rating scale means.

Multivariate Effects

Whatever the degree of scale invariance in Figure 1, the authors' multivariate results appear to indicate much larger treatment effects. For example, the (varimax-rotated) factor loadings for one attitudinal object, the university, are plotted for the two treatment conditions in Figure 2. Here we see nothing like the correspondence observed for the univariate results in Figure 1.

However, the use of rotated factor loadings here could be clouding the issue of multivariate correspondence. Principal-components loadings would provide a safer comparison, due to their uniqueness in representing a correlation matrix; better yet, the most direct comparison possible would be given by the plot one correlation matrix upon the other, i.e., the plot of the (102) = 45 correlations for the discrete rating condition upon the corresponding correlations observed for the continuous, graphic ratings. This type of direct comparison has been used by Andrews and Withey (1976) to assess the in-variance of multivariate relationships in social-indicators research.

If these further probes should uphold the low level of association in Figure 2, then it would appear that the authors have put their finger on a particular univariate-multivariate differential in rating scale work. It is interesting to note that other multivariate effects in surveys have been cited by Turner and Krauss (1978), who indicate that "context and wording artifacts may disrupt the pattern of correlations between subjective indicators and other variables..." Of course, context and wording represent stimulus effects, while the number of rating categories studied here is on the response side.

THE EFFECTS OF SPONSORSHIP ON MAIL SURVEY RESPONSE AND EVALUATION BIAS

In the Miller and Turner paper we move over to the stimulus side in considering the contextual effect of perceived questionnaire sponsorship. This is a field experiment at the outset with three "sponsorship" groups being randomly drawn from the checking account file of a particular bank. Under each condition of perceived sponsorship the respondent rated 13 attributes of bank service, which is the only attitudinal object considered. However, prior to the analysis of the ratings, the authors evaluate the treatment effects upon response rate itself. These two types of dependent variables lend a very interesting quality to this study in that the first defines a true field experiment, while the second derives from a quasi-experiment carried out on the returned questionnaires in each treatment group.

The Field Experiment upon Response Rate

Although all three treatment groups had response rates below 50%, a significantly higher rate of return was observed for the group receiving a cover letter signed by a bank officer. The authors' conclusion that "commercial sponsorship has a favorable effect on the quantity of response" would seem to be too strong, however, on the basis of these results. When the sponsor is your own bank, asking questions about banking service, the higher response rate is not surprising, and it may not generalize to other survey settings. For example, it is an open question as to whether or not a General Mills food-preference survey would garner a higher response rate than an identical survey carried out by the USDA.

The Quasi-Experiment with Attribute Ratings

In the subsequent comparisons of the attribute ratings, the three treatment groups are no longer randomly equivalent. However, despite the less than 50% return in each group, a discriminant analytic check revealed that the three sub-groups of returnees were still demographically comparable. On this basis the authors proceed to Table 2, which reports the results of two stepwise discriminant analyses, each involving a comparison of two treatment groups. In these analyses only three of the 13 satisfaction items surface as discriminators.

Moreover, two of these three items wash out when the discriminant analyses are checked by means of a separate one-way analysis of variance for each item. Thus it would seem that perceived sponsorship had virtually no contextual effect upon rated attribute satisfaction.

This latter result could have been more compactly shown by substituting (for Table 2) a single, three-group discriminant analysis parallel to that carried out with the demographic variables. This would be attended, of course, by 13 corresponding analyses of variance, each involving three groups. Putting this technicality aside, however, the present application of the stepwise discriminant procedure, with the ANOVA check, represents an interesting approach to the selection of attribute-satisfaction items in survey research. This technique might be used in its own right to select attributes which discriminate between, say, known demographic segments.

SURVEY DATA RELIABILITY EFFECTS ON RESULTS OF CONSUMER PREFERENCE ANALYSES

Thus far we have looked at the response format itself, as well as stimulus-context effects upon questionnaire responding. The paper by Horowitz and Golob, in contrast, spotlights the survey respondents themselves as potential sources of distortion. Their consideration of respondent unreliability as a measurable "trait" places this personal characteristic among other response dispositions, such as acquiescence and social desirability. Hence, this paper is also a contribution to the literature of personality assessment.

The Trait of Unreliability

Unreliability is first approached here through a sub-sample of 56 individuals (of the panel of 1,565), who were retested with 87 items. A test-retest correlation (over items) was calculated for each individual, and the distribution of these coefficients appears in Figure 1. This distribution is then used to dichotomize the sub-sample into "reliable" and "unreliable" respondents. Subsequent study of these protocols indicated that the unreliable respondent tends toward (1) a high usage of a particular response category, and (2) a high proportion of unanswered items. Thus, an individual's questionnaire can be scored for these two characteristics, just as it can be scored for other response dispositions. We should note, however, that the first characteristic is not an independently discovered property of unreliable respondents. That is, single category usage depresses intra-individual variation over items, which, in turn, depresses the test-retest coefficient used to classify the individual as unreliable. In any event, since each questionnaire can be scored in this way, the authors are able to identify unreliable respondents within the panel of 1,565 measured at the first time point only.

Application of Unreliability Scores

Once so identified, the unreliable respondents were eliminated from the initial panel, and the effects of this deletion upon the (aggregate) scale distributions were observed. First, when radically different, futuristic transportation concepts are rated, this elimination shifts the distribution outward toward the nearest end-point of the scale. This shift is attended by a decrease in the dispersion of the distribution, which, however, would appear to be an artifact associated with the truncated nature of scale itself.

Next, when presently-owned vehicles are rated, opposite shifts occur. Finally, for an in-between concept, i.e., one which is rather similar to existing automobiles, the elimination of unreliable respondents leaves the distribution essentially unchanged. Thus, the authors demonstrate a systematic interaction between respondent unreliability and the degree of departure of product concepts from existing products. Also, since the response distributions appear, in the main, to be on the positive side of the various scales, unreliable, i.e., older and less educated, respondents, would seem to be more favorable toward existing vehicles and less favorable toward radically different ones.

In a final assessment of the effect of unreliability, Horowitz and Golob relate ownership preferences among transportation concepts to satisfaction with fuel economy and vehicle size, which were found to be "representative'' attributes. A multinomial logit model (the details of which are not given) was used to estimate attribute weights, which were essentially unaltered by the elimination of unreliable respondents. Of course, parameter robustness to unreliability must be established for each model and attitudinal domain separately, but the present paper outlines the method for doing this with survey samples. The very availability of this procedure, however, raises the question of what to do if the model is affected by unreliability. That is, which set of parameter estimates should we keep; those for the whole sample, which are more representative, or those for the subsample, which are more reliable? Since this is much broader than a modeling issue, we will return to it below.

FUTURE DIRECTIONS

Looking back upon these three papers, as well as rating-scale studies in general, it is clear that the distinction between aggregate and individual measurement is crucial. In the aggregate case Horowitz and Golob found no differences between test-retest averages, but their cross-sectional correlation for each test-retest pair fell well below .50! This different order of magnitude between aggregate and disaggregate relationships is commonly seen in survey research. For example, in the area of consumer optimism, aggregate time-series regressions display multiple correlations in the .90's, while their cross-sectional counterparts fall into the .40's (Strumpel, Morgan, and Zahn, 1972). In the USDA's consumer satisfaction surveys the aggregate regression of food satisfaction upon attribute satisfaction shows an R2 of .98. This coefficient falls to .64 in the corresponding cross-sectional regression (Bechtel, 1978).

The message here is clear and well-known, i.e., aggregate measures are more stable over time and also lend to higher degrees of prediction and/or explanation. However, this statistical advantage of aggregate measurement at the societal level does not solve our measurement problems at the individual level. Even so, the numerical rating scale, with its assignment of integers to ordered categories, continues to enjoy almost ubiquitous usage. These numerical assignments involve scale truncation, as well as the (perhaps unwarranted) assumption of equal perceptual steps between successive categories. In a sense, this is measurement by fiat, since there is no way of assessing the fit, i.e., of evaluating, the measurement procedure itself.

It is possible that stochastic response models, with parameters estimated by maximum likelihood methods, can alleviate these problems at the individual level of questionnaire measurement (cf. Cox, 1970). In the aggregate case, however, clear alternatives to numerical rating methods are already available. Here there is a natural grouping of questionnaire responses such that we are able to observe the proportion of respondents falling into each scale category. These proportions circumvent the assignment of integers to categories by enabling us to substitute a probabilistic response model for the usual rating method. This model can be parameterized to represent object or attribute values, as well as the reference points upon the rating continuum. Therefore, estimates of object or attribute parameters now replace the mean ratings usually calculated.

A principal advantage of model-based measurement lies in the fact that the measurement procedure itself is vulnerable to a goodness-of-fit test, which assesses the scalability, and therefore the quality, of the data itself. This approach, known in psychometrics as the Method of Successive Intervals, is described by Torger-son (1958), and its application to food-preference surveys by Jones (1960) has been noted above. Also, a generalization of successive-intervals scaling in the form of a logistic response model has recently been presented (Bechtel, in press). This generalization permits us to study contextual effects, such as perceived sponsorship, upon scale values and upon scalability itself under several experimental conditions.

Finally, whether stochastic response models or more standard methods are used for survey measurement, we will continue to have less control over response dispositions than we have over stimulus biases. Obviously, questionnaires can be pilot tested and corrected for stimulus effects such as context or wording, but respondent unreliability is out of our hands. The tactic of throwing out unreliable survey respondents seems very severe, especially since they are older and less well educated. This demographic deletion aggravates the non-response problem, already serious in survey research (ISR Newsletter, 1976), by further warping carefully drawn probability samples. Of course, the problem of sample representativeness involves the ultimate application of present research findings, which is somewhat beyond the scope of this session. The paper by Horowitz and Golob has indeed contributed to our thinking about response error by placing unreliability among other potentially biasing response dispositions, such as acquiescence and social desirability. The continuous nature of all of these traits, though, points up the further problem of establishing arbitrary cut-offs in purifying survey data.. Whether the benefit of this purification will outweigh the cost in representativeness is an important matter for future inquiry.

REFERENCES

Frank M. Andrews and Stephen B. Withey, Social Indicators of Well-Being (New York: Plenum Press, 1976).

Gordon G. Bechtel, "Consumer Satisfaction with Foods and Their Attributes," Working Paper No. 6, Center for Consumer Research, College of Business Administration, University of Florida, 1978.

Gordon G. Bechtel, "A Scaling Model for Survey Monitoring," Evaluation Quarterly, in press.

D. R. Cox, Analysis of Binary Data (London: Methuen, 1970).

ISR Newsletter, "The Growing Interest in Telephone Interviewing," Institute for Social Research, The University of Michigan, 4 (Autumn, 1976), 2.

Lyle V. Jones, "Some Invariant Findings under the Method of Successive Intervals," In H. Gulliksen and S. Messick (eds.), Psychological Scaling: Theory and Applications (New York: John Wiley and Sons, 1960).

B. Strumpel, J. N. Morgan, and E. Zahn (eds.), Human Behavior in Economic Affairs (Amsterdam: Elsevier, 1972).

Warren S. Torgerson, Theory and Methods of Scaling (New York: John Wiley and Sons, 1958).

Charles F. Turner and Elissa Krauss, "Fallible Indicators of the Subjective State of the Nation," American Psychologist, 33 (May, 1978), 456-470.

----------------------------------------