Discussion Paper: Issues in Survey Measurement

Paul E. Green, University of Pennsylvania
[ to cite ]:
Paul E. Green (1979) ,"Discussion Paper: Issues in Survey Measurement", in NA - Advances in Consumer Research Volume 06, eds. William L. Wilkie, Ann Abor, MI : Association for Consumer Research, Pages: 548-549.

Advances in Consumer Research Volume 6, 1979      Pages 548-549


Paul E. Green, University of Pennsylvania


One of the common threads of the three papers comprising this session is that they all employ rating scales as a major aspect of the study. The historically much maligned rating scale has managed not only to survive its many critics but, indeed, to flourish and proliferate with the intensity of unbottled fruit flies. Such resilience and ubiquity should not go unrewarded, as these papers attest.


The Best, Hawkins, and Albaum paper examines the relationship between two types of rating scale intervals--a five-point scale and a continuous scale--from both a univariate and multivariate standpoint. The univariate results are fully in accord with expectations. The authors find very high correspondence between scale means; the Kolomogorov-Smirnov test for correspondence of distributions reinforces this finding.

In contrast, the multivariate portion of the study--entailing separate factor analyses of the two types of rating scales and subsequent correlations of factor loadings between scale types--showed poor results. However, before concluding that factor analysis is all that sensitive to the numbers-of-intervals characteristic, a few comments might be raised regarding the authors' methodology:

1. Assuming that the authors applied principal components analysis to the correlation matrix (followed by Varimax rotation), why did they not correlate the (off-diagonal) correlations between the two types of scales? This would provide a direct summary of how closely related the two types of scales were and, of course, could be done for each of the three concepts separately.

2. The factor loadings summaries of Table 2 do not show particularly clean patterns for each scale type separately. One wonders if the "right" number of factors were extracted in the first place. Did the authors check the between-scale correspondences for, say, the 2-factor and 4-factor solutions?

3. Perhaps the analysis summarized in Table 2 should have been augmented by a factor matching procedure. Assuming that the authors would wish to retain orthogonality of factors, they could use either Cliff's procedure (Cliff, 1966) or the closely related Schonemann and Carroll factor matching approach (Schonemann and Carroll, 1970).

In summary, there are some unsettled questions about the lack of congruence between the two types of ratings scales that might be related to the factor analyses themselves. If the preceding suggestions are implemented and the results still support the authors' earlier conclusions, then additional study of the problem is clearly indicated. Perhaps this additional study could include Monte Carlo simulations as well as the type of empirical comparisons reported here. Finally, from a behavioral viewpoint, we might ask: why would scale interval type be expected to affect covariances, but not the means or variances?


The paper by Miller and Turner also employs rating scales as a central feature of the analysis. The authors have turned out an attractive field experiment dealing with the question of whether type of sponsorship affects mail survey response rate, the demographic composition of those who do respond, and rating scale response patterns.

For the most part the study was nicely designed and executed. However, a few critical comments--hardly more than nit-picks--might be offered as a space filling effort:

1. Were such background variables as number of years with current bank, number of different bank services used, or average size of checking account balance included in the survey? If so, perhaps this type of variable might show greater sensitivity to survey sponsorship than the more traditional demographics.

2. In the Table 1 analysis, one might wish to conduct a one way ANOVA, followed by multiple comparison tests (assuming that the overall ANOVA results are significant).

3. Why did the authors not run a three-group discriminant analysis--similar to what was done in the case of the demographics--in examining the significance of mean differences across rating scales?

4. One wonders if the Table 2 findings are useful, given no significant univariate F values. At the very least, additional discriminant runs should be made to see if the two variables are significant together and, if so, whether both are needed. Model comparison tests could be used for this purpose (Rao, 1952).

However, these are quite minor suggestions that do not affect the substantive conclusions of the paper. The study deals with a set of interesting issues in survey research. One naturally wonders how well the results on response bias (actually, the lack thereof) generalize to other contexts and respondent populations.


The paper by Horowitz and Golob addresses an important problem in the use of rating scales, namely their test retest reliability. The authors' study is quite competently carried out and my critical comments are few in number and minor in import. However, some points come to mind which the authors might wish to consider:

1. How did the authors arrive at a sample of 56 respondents for test/retest purposes? Perhaps two independent samples could have been drawn from purposes of assignment rule validation. If this were done, the assignment rule portrayed in Table 1 could have been checked against fresh data before going to the main sample.

2. What appears to be missing is information on which of the 29 scales show the greatest incidence of missing data, which sets of scales show the highest (relative) incidence of unvarying response categories, or whether the concepts themselves differ in terms of the incidence of these putative sources of unreliability. In other words, while we may know that missing data and low variance in response category selection are associated with low reliability, how can we design respondent tasks to reduce the tendency for these bad things to occur?

3. When one returns to the total sample, it would also seem appropriate to split that sample into halves and carry out some type of cross validation of the new assignment rules adopted in Table 2.

4. Incidentally, what is the rationale for the four types of rules in Table 2? Also, what is the precise nature of the significance tests being carried out? Do they involve the subsample mean versus the rest of the sample (for each subsample mean, in turn), or what?

5. Insofar as Hypothesis I is concerned, how big are the effects? Clearly, with a sample size this large the results could be statistically significant but not operationally important.

6. Hypothesis II makes me a bit uncomfortable. The study shows that unreliability and age/education are associated. But how about reliable older, poorly educated respondents? Do they also show the same stereotype evinced by unreliable older, poorly educated respondents?

7. As a matter of interest, why all the concern about Hypotheses I, II, and III in the first place? It seems to me that if reliability (as defined in this study) is really important, it is not appropriate nor useful to average over the responses of reliable and unreliable respondents. A more sensible thing, of course, is to analyze both sets of data separately. I assume that this is what most researchers do currently when they have reason to suspect different reliability levels across respondents and assignment rules are available for classifying respondents according to their degree of reliability. (Degree of reliability may simply reduce to dichotomous assignments in some cases.)

8. Unfortunately, it is not clear to me how the multinomial logit model was applied in this example; perhaps the authors could spend some time in showing us how they developed the predictor-variable set and how the criterion variable was defined. However, it is not surprising that the type of unreliability found in this study did not affect the logit model results.

9. On a more general basis it has been known for a long time that random error added to regression-like clinical judgment models does not affect the relative sizes of the partial regression coefficients. Similarly, in MDS programs, such as Carroll and Chang's INDSCAL, the addition of random subjects has little effect on the group stimulus space, even though the overall fit is (obviously) reduced.

All in all, this is an interesting study that deserves follow up by other researchers. In particular, I would like to see what other clues to unreliability might be found in other survey research and what background correlates (comparable to age and education in the current study) are noted.


When all three papers are considered together, what lessons can be learned? Bearing in mind that the papers were prepared independently and the session theme is a broad one, it still seems to me that a few suggestions for further research can be made.

For example, both the Best, Hawkins, and Albaum study and the Horowitz and Golob research might benefit from comparison studies that utilize Monte Carlo simulation. Clearly, the Monte Carlo technique could be used to examine questions related to number of scale intervals and other kinds of scale transformations on factor loadings matrices and on "true" versus spurious factors. Moreover, Horowitz and Golob could use Monte Carlo methods to see what kinds of unreliability affect the parameter estimates derived from such decomposition models as the multinomial logit, and how the estimates are affected.

A second area, of potential relevance to all three papers, involves the design of appropriate rating scales and perceptual/evaluative tasks in the first place. Are people more comfortable using one type of rating scale than another? Can various devices be used to reduce the incidence of missing data and same-category checking? Should rating scale design vary with the type of sponsor identified in the study? Of course, the list is virtually endless since so few general principles have been adduced to date, despite the fact that rating scales are almost de rigeur in any self-respecting survey that goes out these days.

Not to end on an unduly heretical note, but perhaps respondents are being asked to rate just too many things on too many scales, where the scales are largely redundant to begin with or are too vague and nonoperational to be useful, once the results are in. Certainly, those working in psychographics have had cause for concern regarding the ability of clusters based on life style rating patterns to predict other aspects of the consumer, such as brand choice. More importantly, can attribute rating scales--no matter how cleverly designed and executed--provide rich enough response data to enable us to design an "optimal" university book store, a consumer banking service, or a new automobile? My own experience with attribute ratings suggests that this approach to product/service design leaves much to be desired.

Clearly, rating scales for operationalizing perceptions of choice objects have a long history in expectancy-value modeling and the like, and it is not likely that this approach will soon be replaced by conjoint techniques (Green and DeSarbo, 1978), or other such contenders. Still, one wonders if other ways of eliciting perceptions can be found that provide a sounder basis for product/service design and better predictions of preferences for new objects. At least it might be fun to look.


Norman Cliff, "Orthogonal Rotation to Congruence," Psychometrika, 31(1966), 33-42.

Paul E. Green and Wayne S. DeSarbo, "Additive Decomposition of Perceptions Data via Conjoint Analysis," Journal of Consumer Research, 5(June, 1978), 58-65.

C. R. Rao, Advanced Statistical Methods in Biometric Research (New York: John Wiley and Sons, 1952).

Peter H. Schonemann and Robert M. Carroll, "Fitting One Matrix to Another Under Choice of a Central Dilation and a Rigid Motion," Psychometrika, 35(1970), 245-256.