Survey Data Reliability Effects on Results of Consumer Preference Analyses

Abraham D. Horowitz, General Motors Research Laboratories
Thomas F. Golob, General Motors Research Laboratories
ABSTRACT - Mail panel survey respondents were resurveyed to assess data reliability. Aggregate measures of evaluations toward hypothetical new products were found to be reliable. However, certain respondents were unreliable, and decision criteria were developed for identifying such respondents when resurvey data are unavailable. The effects on data analyses results of eliminating potentially unreliable respondents from the survey sample were investigated. The magnitude of the effects depended upon the type of analyses conducted.
[ to cite ]:
Abraham D. Horowitz and Thomas F. Golob (1979) ,"Survey Data Reliability Effects on Results of Consumer Preference Analyses", in NA - Advances in Consumer Research Volume 06, eds. William L. Wilkie, Ann Abor, MI : Association for Consumer Research, Pages: 532-538.

Advances in Consumer Research Volume 6, 1979      Pages 532-538


Abraham D. Horowitz, General Motors Research Laboratories

Thomas F. Golob, General Motors Research Laboratories


Mail panel survey respondents were resurveyed to assess data reliability. Aggregate measures of evaluations toward hypothetical new products were found to be reliable. However, certain respondents were unreliable, and decision criteria were developed for identifying such respondents when resurvey data are unavailable. The effects on data analyses results of eliminating potentially unreliable respondents from the survey sample were investigated. The magnitude of the effects depended upon the type of analyses conducted.


Reliability is an essential characteristic of a measuring instrument. Synonymous for reliability are dependability, stability, consistency, and predictability. A measure is said to be reliable if it produces the same results on different occasions when conditions are kept constant. Reliability is a necessary but not sufficient condition for validity.

The concept of reliability in social sciences was developed in the context of psychological and educational measurement and was focused on how consistently skills of individuals are rated on two successive occasions (test-retest procedure). If the time interval between test and retest is short enough, changes can be attributed to the unreliability of the measuring instrument and/or the individual rather than to a true change in the individual. In the clinical and social psychological literature there is abundant evidence that attitude scales, like individuals' skills scales, are reliable, yielding comparable results when administered on different occasions (Shaw and Wright, 1967).

In consumer behavior literature, on the other hand, the issue of reliability is seldom discussed, and data regarding the reliability of the instruments used in consumer research are rarely provided. As an illustration, Jacoby (1975) reviewed 300 brand loyalty studies finding only three studies which measured data test-retest reliability. As a second illustration, at the previous two annual meetings of the Association for Consumer Research (1976 and 1977) apparently only two papers dealt with data reliability (Peter, 1977 and Best et al., 1977). The first study discussed the methodological problems encountered in the measurement of data reliability and the second study reported high reliability at the aggregate level, but low to moderate reliability at the individual level. The present research is aimed at explaining such distinctions between aggregate and individual level results.

The purpose of the present research is twofold. The first objective is to test the reliability of consumers' evaluations of hypothetical new personal transportation vehicles elicited through a mail panel survey. This is accomplished through application of a test-retest procedure. The second objective is to test whether elimination of potentially unreliable survey respondents significantly affects results concerning consumers' preferences among the hypothetical concepts and attribute importances inferred from choice models. This is accomplished through comparison of results from analyses of the entire mail-panel survey sample with those from analyses of subsamples. These subsamples are determined by application of criteria developed in the test-retest analyses to identify potentially unreliable respondents.


A nationwide mail panel survey of 1565 consumers was administered to elicit consumer responses to four hypothetical new transportation concepts. For comparison, evaluations of the present vehicle driven or ridden in most often by respondents were also collected. Concept presentations consisted of words and sketches. In recognition of the complexity of the survey, each respondent was presented with only three of four concepts, chosen and ordered randomly.

In order to test the reliability of these data, fifty-six respondents were randomly chosen from the initial survey and were asked to respond again to the same questionnaire four months after the full scale survey. Survey responses of the fifty-six respondents on the initial surveys are referred to as "test" data, while responses to the same questions collected in the follow-up survey are referred to as "retest" data.

Respondents were asked first to rate how much they like or dislike each hypothetical concept (called hereafter, 1st affect scale). Affect ratings were also obtained for the presently owned vehicle which each respondent drove or rode in most often. Then, respondents rated their satisfaction with their present vehicle and with each concept on each of seventeen attributes. Following satisfaction ratings, respondents were asked to rate how much they agree or disagree with nine statements concerning the concepts (called hereafter agree-disagree scales). Finally, respondents were asked again to state their like or dislike with the concept (2nd affect) and then to state their intention to purchase each concept "if this vehicle were available today" (purchase intention scale).

All the scales have seven categories, (-3, -2, -1, 0, 1, 2, 3), with the exception of purchase intention which has five categories. For purposes of interpreting results in the present research, liking, positive intention, satisfaction and agreement are described by positive numbers; dislike, negative intention, dissatisfaction and disagreement are described by negative numbers. Neutral attitudes are represented by the number zero.


Aggregate Measures

The differences between test and retest means are small for all scales, their absolute values being typically less than or equal to 0.5, or 1/2 of the interval between two adjacent categories. Average test-retest ratings for each scale are presented in Appendices A and B. None of the differences were significant at the a = 0.01 level for univariate t-tests, suggesting that no systematic shifts in aggregate attitudes occurred. Moreover, correlation coefficients between test and retest computed over all average ratings were very high, ranging from 0.94 to 0.97 for the concepts (n = 29 scales for each concept). The order of the average Affect and Purchase Intention ratings of the concepts and the present vehicle are the same in the test and retest. Therefore, at the aggregate level -- when individual test-retest changes are ignored -- the survey results were very reliable.

Disaggregate Measures

A different picture emerges when the test ratings for each scale are related to the retest ratings, taking the individual respondent as the unit of observation. Correlations between test and retest for each of the twenty-nine measures (affect, purchase, intention, satisfaction and agree-disagree scales) were calculated for each concept separately. The number of scales with correlations significantly different from zero at the 0.01 level ranged from four (14% of the scales) to twelve (41%) for the concepts and was 13 (72%) for the present vehicle. The significant test-retest correlations were in the moderate range: 0.33 - 0.70. These results imply that a simple linear model is not able to predict the retest data from the initial test data on an individual-by-individual basis for most of the scales.

The insignificant correlations may be due in part to the concentration of responses within a limited range of the scales or to attenuation due to interval assumptions applied to ordinal quality data (Bohrnstead, 1970).

To avoid possible shortcomings of correlation analysis an alternative approach based upon the average absolute deviation between test and retest was developed. Defining EQUATION, where n is the sample size and xi and yi denote respondent i's scale responses in the test and retest, respectively, an iterative algorithm was devised to compute probabilities that dxy will be less than certain values assuming random distributions of responses. The absolute deviation measures computed for each scale and for each concept were then compared to the 0.01 level critical values found for dxy. The responses on most of the scales (21, 23, 24 and 28 out of 29 scales) for the four concepts and on all scales for the present vehicle were found to be significantly different from random data at the .01 level.

Thus, while test-retest correlations computed on an individual-by-individual basis are not significantly different from zero for most of the scales and are low to moderate for the other scales, hypotheses of random test-retest relationships are consistently rejected.

Characteristics of Unreliable Respondents

It was hypothesized that the low to moderate correlations were due to the inclusion in the sample of certain unreliable respondents. In order to identify such respondents a correlation coefficient was computed for each one of the 56 individuals in the sample. For a respondent with complete data, the sample size for this computation was 87 scales (29 scales for each of three concepts). Figure 1 displays the distribution of individual correlations. Since the distribution is bimodal, two types of respondents emerged: "reliable respondents" characterized by moderate and high correlations ranging from 0.45 to 0.79, and "unreliable respondents" by low correlations from -0.16 to 0.38. Remarkably, all 39 correlations greater than 0.45 were significantly different from zero at the ~ = 0.001 level, whereas the other 17 correlations were not.



Two criteria were found which effectively distinguished between reliable and unreliable respondents using data collected at only one point in time:

1. The proportion of the respondent's total scale answers which were in the most frequent scale category for that respondent (i.e., the respondent's propensity to provide the same answer for every question).

2. The proportion of unanswered scales (missing data).

Figure 2 and 3 display the distributions of these two criteria. The first figure shows that among the unreliable respondents there was a considerably greater tendency to provide the same answer for every scale, that is, to mark the same scale category. For example, while 25% of the unreliable respondents marked the same scale category in more than 50% of their responses, only 3% of the reliable respondents did so.



Figure 3 shows that while almost 70% of the reliable respondents answered all the scales, 65% of the unreliable respondents skipped some of the scales.

The two criteria can be used to identify potentially unreliable respondents in a particular data set by establishing a cutoff point for each criterion. Table 1 shows that if these cutoff points are set for the test data such that only those respondents in the test whose most frequent rated category is less than 40% and the percentage of missing data is less than 6% are retained, then the rate of correct classification of reliable respondents is 79% and of unreliable respondents is 76%. The optimum cutoff point must be decided conditionally upon the availability of data and the cost involved in discarding data.

As expected, the elimination of unreliable respondents leads to an increase in the number of scales with correlations significantly different from zero; for the total sample 30% of all scales had significant correlations, which increased to 48% for the reliable subsample. Moreover, test-retest absolute deviations were computed for all scale and concept combinations for the reliable subsample. The number of deviations that are smaller than the critical value for the 0.01 level increased from 85% to 92%.


Subsample Determination

For each respondent to the nationwide mail-panel survey (n = 1565) two measures were computed based upon the test-retest results: (1) proportion of unanswered scales (i.e., missing data) and (2) proportion of responses in the most frequent scale category.





Figures 4 and 5 present the frequency distributions of these two measures. Figure 4 shows that a substantial portion of the sample had a rate of missing data of more than 1%. Figure 5 shows that for many respondents (16%) the proportion of responses in their most frequent category was between 40% and 50%. Based then on the shape of these two distributions, and the decision rules developed for the test-retest sample (Table 1), two levels of acceptance for missing data and two levels for the most frequent response category were chosen. These two levels are shown in Table 2. For a given cell in the matrix of Table 2, a respondent who did not pass both acceptance levels was identified as being unreliable and was eliminated from consideration. The sizes of the remaining subsamples ranged from 41.6% to 76.5% of the total sample. This range includes the percentage of test-retest respondents found in the test-retest to be reliable (70%).

Sociodemographic Characteristics of Potentially Unreliable Respondents

Results summarized in Table 3 show that respondents who satisfy the two reliability criteria are younger and better educated than those who do not satisfy the criteria. These differences are statistically significant at the a = 0.01 level. No statistically significant differences among the subsamples and the total sample were found with respect to income, sex, household size, auto ownership or housing type.



Attitudes Toward Radically New Concepts

It is assumed that unreliable respondents, when presented with hypothetical new products or services, respond in random fashion or tend to provide the same answer for all questions. Consequently, the average of their responses for a given scale should approximate zero, the scale's middle ground. This leads to the first hypothesis regarding the effect of eliminating potentially unreliable survey respondents.







Hypothesis I: The elimination of potentially unreliable respondents to a consumer survey results in a systematic shift in ratings toward hypothetical new products or services on each attitudinal scale:

(1) An increase in the absolute value of the average attitudinal measure (i.e., a shift in the average rating outward toward the nearest end-point of the scale); and

(2) A decrease in data dispersion.

Three of the four hypothetical transportation concepts were radically different from anything commonly used today. Average ratings and standard deviations were calculated for all 29 scales for each of these three concepts. Differences between each subsample and the total sample on the absolute values of the average ratings and the standard deviations were then computed. These data show that the first part of Hypothesis I is confirmed for 22 to 25 scales for one typical concept, depending on subsample. The systematic shift toward the extreme ends of the scales cannot be explained on the basis of pure chance, because the probability of obtaining 21 or more changes in the predicted direction, of the 29 scales, is less than 0.01. Results for the other two concepts are similar, with confirmations for 27 to 29 scales and 21 to 23 scales respectively. The second part of the hypothesis regarding a decrease in data dispersion was confirmed for all scales in all concepts without exception.

Differences among the four subsamples were ordered in the expected direction for each concept. That is, the more severe the criteria of reliability, the larger the differences in both mean and standard deviation between the subsample and the total sample.

Attitudes Toward Existing Concepts

Differences in scale means and standard deviations between the total sample and each of the subsamples in respondent's evaluations of their existing vehicles were also calculated. Here Hypothesis I was consistently rejected. All mean ratings for the subsamples are lower in absolute value than those for the total sample, and most standard deviations are higher. Potentially unreliable respondents were more extreme and positive toward their existing vehicles than were the remaining respondents.

Since potentially unreliable respondents were older and less educated than the average respondent, it follows that these respondents are also more satisfied with their present vehicles. This result is consistent with findings (Campbell et al., 1976) that the higher the educational level of an individual the less happy he or she will be with the quality of life, due to higher expectations. A second hypothesis thus emerges.

Hypothesis II: The elimination of potentially unreliable respondents to a consumer survey results in a systematic shift in ratings of well known and strongly liked products or services on each attitudinal scale:

A decrease in the absolute value of the average attitudinal measure (i.e., a shift in the average rating away from the positive end point of the scale).

Attitudes Toward Evolutionary New Concepts

One hypothetical transportation concept was considerably more similar to existing automobiles than was any other concept. With regard to this concept, both Hypothesis I and Hypothesis II hold to some degree. Since these two hypotheses lead to opposite shifts in attitudinal measures, a substantially lesser degree of systematic shift is detectable for this concept. Consequently, a third hypothesis is proposed to cover this composite situation in which neither a totally new product nor an existing one is the subject of consumer evaluations.

Hypothesis III: The elimination of potentially unreliable respondents results in no systematic shift in attitudinal ratings of new products or services which are evolutions or variations of well known existing products or services, due to the joint effects of Hypothesis I and II.

Aggregate Preferences for the Concepts

Consumers' aggregate preferences among the concepts were computed from responses to a survey question eliciting a ranking of the concepts in terms of ownership preferences. These rankings are closely related to the order of the concepts in terms of "second affect" rating; in 86% of all cases, if one concept was ranked higher than another, it was also rated more liked or disliked (or the same) on the second affect questions. Consequently, Hypothesis I, II, and III, which cover shifts in mean affect ratings, can be cross-validated using aggregate results from the ranking question.

Preference Choice Models

Consumers' preferences among the concepts were modeled in terms of consumer satisfactions with various attributes of the concepts. In this way relative importances of the features were estimated. The specific choice model used was the multinomial logit model (Punj and Staelin, 1976). This model is subsumed in the class of models referred to as strict utility models by economists and Bradley-Terry-Luce models by psychologists. (For an overview of these choice models see Luce, 1977.) It is purported herein that the following two characteristics of the multinomial logit model, together with the small magnitude of the shifts described in the previous Sections, will cause the preference model results to remain approximately unaltered when potentially unreliable respondents are eliminated. It was found empirically that this was indeed the case.

First, the logit model, and most other models in the ascribed class of strict utility models, is specified in terms of utility differences between pairs of choice alternatives. Thus, the specification is unique only up to differences in satisfaction ratings between two concepts on any attribute. Since the shifts in attribute ratings resulting from elimination of potentially unreliable respondents are in the same direction for all concepts, the small magnitudes of these shifts are further cancelled out in computing differences. This is demonstrated in Table 4, which shows that for only 7 of 48 differences across all pairs of concepts for two attributes are there significant differences between a subsample and the total sample at the ~ = 0.01 level. These two attributes were found to best represent two orthogonal factors determined in principal component analyses of the seventeen satisfaction scales. Results similar to those in Table 4 were found for all remaining attributes as well.

Second, the technique used to estimate coefficients of the utility function specified in the multinomial logit model is that of maximum likelihood. Maximum likelihood estimators for the logit model were shown by McFadden (1973) to be asymptotically unbiased, efficient and consistent. Such desirable statistical properties insure that such estimators are well behaved in the presence of additive disturbances which are independently distributed across the population. A necessary condition for estimator invariance in light of the introduction of disturbances attributable to unreliable respondents is that such disturbances exhibit the same central tendency as disturbances attributable to other sources, such as noise in measurement and misspecification of utility.

Table 5 shows that the models yield approximately the same results. Thus, inclusion of potentially unreliable survey respondents in this application apparently does not bias conclusions regarding the relative importances of various design features of the transportation concepts. Similar stability would be expected for any of the so-called multi-attribute attitude models frequently used in consumer research. These models (reviewed by Wilkie and Pessemier, 1973) typically hypothesize that a consumer compares products by combining values or utilities on individual features in a manner similar to that underlying the logit model.

Thus, while the shifts in attitudes resulting from removal of potentially unreliable respondents are systematic, since these shifts are small, and since only the relative values of attitudes on different features and different products are used, there are no major changes in preference model results.










R. J. Best, D. I. Hawkins, and G. Albaum, "Reliability of Measured Beliefs in Consumer Research," in W. D. Perreault (Ed.), Advances in Consumer Research, 4(1977), 19-23.

G. W. Bohrnstead, "Reliability and Validity Assessment in Attitude Measurement," in G. F. Summers (Ed.), Attitude Measurement (Chicago: Rand McNally, 1970), 80-99.

A. Campbell, P. E. Converse, and W. L. Rodgers, The Quality of American Life (New York: Russell Sage Foundation, 1976).

J. Jacoby, "Consumer Research: Telling It Like It Is," in B. B. Anderson (Ed.), Advances in Consumer Research, 3(1976), 1-11.

R. D. Luce, "The Choice Axiom after Twenty Years," Journal of Mathematical Psychology, 15(1977), 215-233.

D. McFadden, "Conditional Logit Analysis of Qualitative Choice Behavior," in P. Zarembka (Ed.), Frontiers in Econometrics (New York: Academic Press, 1973).

J. P. Peter, "Reliability, Generalizability and Consumer Behavior," in W. D. Perreault (Ed.), Advances in Consumer Research, 4(1977), 394-400.

G. N. Punj and R. Staelin, "A Model of the College Choice Process," in K. L. Bernhardt (Ed.), Marketing 1776-1976 and Beyond, 1976 Educators' Proceedings, American Marketing Association (1976), 227-241.

H. E. Shaw and J. M. Wright, Scales for the Measurement of Attitudes (New York: McGraw-Hill, 1967).

W. L. Wilkie and E. A. Pessemier, "Issues in Marketing's Use of Multi-Attribute Attitude Models," Journal of Marketing Research, 10(1973), 428-441.