Alternative Approaches to Assessing the Quality of Self Report Data

Robert A. Hansen, The Ohio State University
Carol A. Scott, The Ohio State University
ABSTRACT - Consumer behavior researchers place a great deal of emphasis on self-report or questionnaire data as input to decision making, but give little attention to assessing the quality of subjects' responses. A review of classical reliability theory indicates that available reliability measures are inappropriate in many consumer behavior data collection situations. This is particularly true in those cases where non-scaled or ad hoc measures are collected. Data quality requirements are discussed and some alternative methods of assessing data quality are presented.
[ to cite ]:
Robert A. Hansen and Carol A. Scott (1978) ,"Alternative Approaches to Assessing the Quality of Self Report Data", in NA - Advances in Consumer Research Volume 05, eds. Kent Hunt, Ann Abor, MI : Association for Consumer Research, Pages: 99-102.

Advances in Consumer Research Volume 5, 1978      Pages 99-102

ALTERNATIVE APPROACHES TO ASSESSING THE QUALITY OF SELF REPORT DATA

Robert A. Hansen, The Ohio State University

Carol A. Scott, The Ohio State University

ABSTRACT -

Consumer behavior researchers place a great deal of emphasis on self-report or questionnaire data as input to decision making, but give little attention to assessing the quality of subjects' responses. A review of classical reliability theory indicates that available reliability measures are inappropriate in many consumer behavior data collection situations. This is particularly true in those cases where non-scaled or ad hoc measures are collected. Data quality requirements are discussed and some alternative methods of assessing data quality are presented.

INTRODUCTION

Consumer behavior researchers depend heavily on self-report data generated through mail survey questionnaires or interview schedules for information about consumer attitudes, preferences, and behavior. Self-report data is an integral part of laboratory and field experiments as well as survey research, and the recent explosion of articles on generating responses to self-report instruments attests to its importance (see Houston and Ford, 1976;(Kanuk and Berenson, 1975). Pressley, 1976 for reviews). Much of this work, however, has been limited to a focus on the quantity of data collected primarily through mail questionnaires. Little attention has been devoted to the parallel issue of the quality of self-report data regardless of the collection method used.

This paper attempts to examine the topic of self-report data quality in some detail. Specifically, the objectives of this paper are to critically evaluate current methods of assessing data quality in those situations where traditional methods cannot be used. To provide an adequate context for this discussion, basic data quality requirements are first reviewed.

DATA QUALITY REQUIREMENTS

At least two broad criteria for data quality can be identified. First, responses must be meaningful. That is, they should reflect something other than a random checking of alternatives. This implies that responses reflect a true assessment of a particular behavior (e.g., frequency of purchase) or a true assessment of some personal characteristic (e.g., attitudes, brand loyalty, dogmatism), as opposed to unsystematic, situation-specific variation. The issue here is typically thought of as the degree of reliability of the responses.

Second, responses should be good indicators of the phenomenon of interest. That is, given that an adequate degree of reliability has been achieved, one must demonstrate that the scores are an adequate representation of the construct or quality under investigation. The issue in this case is one of validity (predictive, content, construct, etc.), which is examined by investigating the sources of the systematic variation observed.

This paper will focus primarily on the first of these criteria since reliability is necessary (but not sufficient) for validity. A second reason for this emphasis is that there are unique problems associated with the assessing the meaningfulness of responses to nonscaled measures which are frequently used by consumer behavior researchers and other applied scientists, and these problems have received relatively little attention in the consumer behavior literature. Finally, the soundness of decisions (whether of managerial or public policy significance) based upon research data is profoundly affected by the degree of reliability associated with the data. As one prominent methodologist notes (Stanley, 1971, p. 58):

Unreliability places a question mark after the score (or response) and causes any judgment based on it to be tentative to some extent. The lower the reliability of the score, the more tentative the judgment or decision must be, until, in the extreme case, as reliability approaches zero, the score (or response) provides no basis at all for any judgment or decision...

TRADITIONAL METHODS OF ASSESSING RELIABILITY

Appropriate methods of assessing the reliability of data depend upon the type of measurement instrument used. Consumer behavior researchers generally use two types of instruments: scaled measures and non-scaled, or ad hoc, measures. A scaled measure is defined here to be the result of pooling or combining a number of items selected from a larger pool to yield a score on the particular dimension of interest (e.g., androgeny, authoritarianism, attitudes toward government, etc.). Non-scaled or ad hoc measures are those which are not designed to be combined for a summary score, but rather are of interest in and of themselves (e.g., frequency of purchase, preference for a given brand, etc.). Although specific methods for assessing reliability will differ depending upon the type of measurement instrument used, the logic should be similar. That is, one should gain some insight into the degree of systematic variation captured by the instrument. The discussion here is an evaluation of the appropriateness of currently available techniques for scaled and non-scaled data.

Scaled Measures

When the measurement instrument consists of a series of items which form a scale, assessing reliability can be accomplished by using one of several traditional methods. The mechanics of these techniques are outlined, and the relative merits of each are discussed, in several marketing research textbooks (e.g., Churchill, 1976; Tull and Albaum, 1973; Tull and Hawkins, 1976) and in several recently published articles (e.g., Lundstrom and Lamont, 1976; Pressley, 1976).

If two measurements are possible, a test-retest correlation can be calculated. This is perhaps not the best alternative, however. It is impractical in many consumer behavior research contexts to contact respondents twice for the necessary repetition of the measure. And, many constructs of interest to consumer researchers are variable over time. Since attitudes toward brands can change substantially even in the short run, for example, a low test-retest correlation would be difficult to interpret. It might be possible to administer parallel forms of the same instrument, but this may be possible only in those situations in which a researcher or his representative is present to provide an explanation for any apparent repetition, and where length of time necessary to complete the parallel forms is not prohibitive.

Equivalence or internal consistency measures are also appropriate for a set of items which form a scale for which a summary score will be derived (Cronbach, 1951; Kuder and Richardson, 1937). This type of coefficient, unlike test-retest, is easily obtained, and should be calculated routinely for scaled instruments (Nunnally, 1967). A high reliability coefficient will provide some evidence that the responses reflect a systematic tendency, and will provide some rationale for combining the responses to form a scale score. Unlike test-retest procedures, equivalence coefficients should not pose any practical difficulties in terms of supervisory or time needs.

In summary, researchers should follow the standard procedures outlined in measurement texts in those instances where scaled measures are used. In some situations, a scaled measure does not exist, but could be constructed. That is, the concept is potentially scalable, but consumer researchers have not yet attempted to do the scale development work. It should be clear, however, that scales should be constructed whenever possible (Hughes, 1971; Lundstrom and Lamont, 1976).

Ad Hoc Measures

In some situations, the researcher cannot or perhaps would not want to construct a scaled measure, and a non-scaled or ad hoc measure is used instead. These non-scaled measures present some difficult for quality assessment. The following section discusses ad hoc independent and dependent measures in greater detail.

Ad Hoc Independent Measures. In assessing the reliability of ad hoc independent or predictor measures, the use for which the data is intended must be considered since some procedures will be feasible for theory-testing research and others will be appropriate for applied or problem-solving research. An example using the multi-attribute product evaluation framework will illustrate this difference. If a researcher is interested in predicting product preference or purchase behavior, a strong case can be made for using the domain sampling model (Nunnally, 1967) and selecting multiple items which represent all aspects of the product evaluation process. Traditional methods such as Cronbach's coefficient alpha could then be used to assess reliability via internal consistency. An applied researcher, however, is frequently interested in evaluating a select group of the product's attributes. Further, he is most likely interested in the individual attributes themselves rather than in an overall summary score. Test-retest procedures are generally not acceptable for practical reasons, and internal consistency is not appropriate since a summary score is not desired. As yet, there are no widely accepted methods for assessing the reliability of this type of data.

Ad Hoc Dependent Measures. Criterion measures such as purchase behavior or product usage must also be examined from a reliability standpoint since reliable predictor variables applied to an unreliable criterion may produce misleading results. Many criterion variables are analogous to physical behavior and motor skills measures discussed by (Stanley, 1971). That is, an item pool does not exist because there are only a small number of ways in which certain questions can be asked. Since these variables are not scalable, traditional measures of reliability cannot be used as indicators of data quality, and therefore, other methods must be developed.

ALTERNATIVE APPROACHES TO EXAMINING DATA QUALITY

There are several techniques which the consumer behavior researcher can use to infer data quality when traditional tools are not appropriate. These methods include analysis of the completeness of response, depth of response, and internal consistency of response patterns. These methods are admittedly non-traditional with only internal consistency of response patterns bearing a slight resemblance in title and operation to a psychometric interpretation of reliability. It must be remembered, however, that the data collection needs of the consumer behavior researcher are also quite often non-traditional in nature requiring the collection of ad hoc or non-scaled data. The results of these analyses can be used in two ways. First, the researcher who has collected data from a number of people can identify those individuals whose responses might be of questionable quality. These questionable respondents might then be discarded from the data base. Second, they may be used to assess the overall effectiveness of various methods used to collect data. Viewed in this manner, the measures could be used, for example, to evaluate the effects of alternative methods of stimulating mail survey response rates (Houston and Ford, 1976).

Completeness of Response

Completeness deals with the issue of item non-response. Of course the researcher's goal is to generate 100 percent response to all questions. Any deviation from total response to all questions raises a number of difficult problems. In many instances, methods used to handle missing values (e.g., drop those cases from the analysis, substitute a mean value, etc.) can have a serious effect in the representativeness of the results (Hansen and Scott, 1977).

Item-nonresponse can be expressed in a variety of ways. These include the mean number of missing values, the percentage of missing values, and complete versus incomplete response (Houston and Ford, 1976). However non-response is defined, any modification of the data collection process which yields a lower item non-response is desirable. Modification of a mail survey could mean the inclusion of an incentive to respond or the promise of anonymity. A second modification of the data collection process could involve the researcher allowing respondents to ask questions or by providing a don't know option for questions.

Depth of Response

Quality can also be viewed in terms of response to open-ended questions. The additional insight provided by an analyses of the depth of response is illustrated by the following example from a project currently being completed by one of the authors. In a mail survey, potential respondents were randomly assigned to one of three incentive groups: a monetary incentive group (i.e., 254 included with the survey), a non-monetary incentive group (i.e., a comparable value ball point pen included with the survey), and a no incentive group. The response rates for the three treatments (which indicate the quantity of data) were 38%, 22% and 14%, respectively. These results seem to indicate rather dramatically the effect of including inducements in the survey. However, when the returned questionnaires were analyzed for percent of missing values, the rates were significantly higher for both incentive groups. In addition, an independent panel of three judges rated the responses to a series of open-ended questions on the questionnaire and these ratings were subsequently analyzed. Results indicated that a significantly greater proportion of the responses in the non-incentive group were rated superior by the judges. In general the overall quality of response from the non-incentive control group was significantly better than that for the two incentive groups. Clearly in this case at least, the quality of data obtained issue causes us to re-evaluate the supposed superior performance of the inducements to respond to a mail survey.

There is another important aspect of assessing data quality which relates to those questionnaires where the respondent has in fact answered all or nearly all questions and where content analysis of responses is not possible. One would, of course, suspect the quality of responses from a person whose questionnaire contained a significant number of nonanswered questions. The equally important but more complex issue of evaluating the quality of a completed questionnaire is discussed below.

Internal Consistency

Internal consistency as used here is perhaps most closely associated with the concept of response bias. Response bias is frequently mentioned in connection with the self selection problem in mail surveys. In a number of these studies mathematical estimates of bias have been generated. (Kanuk and Berenson, 1975) discuss this interpretation of bias and the controversy surrounding its use. In addition response bias has been investigated by validating responses given in self-report sessions against other records of the same information (Kerin, 1974; Kerin and Peterson, 1977). In still other cases, bias has been defined as a difference in response pattern derived from a similar population using for example, different incentives to respond (Whitmore, 1976) or other attempts to stimulate response (Field, 1975; Wiseman, 1972). Bias it appears then has been used as a general umbrella to encompass a number of processes from true validity checks to simple checks of difference. In the former, it is assumed that one response is true (and deviations are errors), and in the latter it is just assumed that differences are bad. Bias, when operationalized using a known or accepted as true validating test, captures the essence of the data quality issue. Bias when defined in this manner focuses on the question of whether the responses are indicative of a true feeling as opposed to a random response or systematically altered response.

There are, however, two problems associated with using validated response bias as an overall indicator of data quality. First, there is the practical problem of collecting the validation information. This kind of information is only rarely available (Kerin, 1974; Kerin and Peterson, 1977) and if it has to be collected in addition to the data of interest it will increase project costs considerably.

A second drawback associated with using bias as a quality indicator is offered by Ferber (1948, p. 670): "The problem of response bias must be considered with specific reference to a particular question or characteristic. The presence of bias in one question does not mean priori that the replies to other questions on the same questionnaire are also biased."

This suggests that question dependent response bias is not an acceptable indicator of overall response quality. What is needed is an assessment of quality derived from the entire self report data collection process. One possibility for assessing data quality which seems consistent with this goal could include bogus questions (Friedman and Goldstein, 1975), but could be expanded to include simple response inconsistencies. Additional questions may have to be added to the data collection instrument, but in many cases this will not be necessary. Careful attention to question wording can be used to generate logic checks. In this way the overall quality of respondent answers is inferred by the overall logic or consistency in response patterns. For example, if a respondent indicates that he/she has never heard of a particular product and then proceeds to evaluate that product, then one should be suspicious. Or, sequential sets of questions may be developed to allow for examination of the pattern of responses. If a respondent indicates that he/she never engages in a behavior and then in a later section answers a frequency question with a response of greater than zero, one must be suspicious of the quality of the data. Clearly, there are a number of safeguards which must be built into the analysis of such logic checks. First and foremost, all procedures must be pre-tested for clarity of instructions and ease of understanding. Second, one must examine the incidence of logic errors. One such error should not be taken as evidence of poor quality or response. Several errors of this type, however, would indicate problems with the quality of responses.

An example should help illustrate the usefulness of the process in detecting problems with data collected. A study was conducted by one of the authors in which a number of logic checks were built into a series of questions dealing with attitudes toward receiving direct mail, readership habits with regard to direct mail, and actions taken in response to direct mail (e.g., use of coupons, shop at store, sales, etc.). Results of the study indicated that approximately six percent of the returned questionnaires contained some logic errors. Only a small proportion of these poor quality respondents had a high proportion of missing values. If it is assumed that the incidence of logic checks indicates poor quality, then attempting to "clean-up" the data by setting some cut off point for an acceptable level of missing values would not accomplish the goal. Low data quality can be a problem even when item non-response is low.

DISCUSSION AND CONCLUSION

The quality of self-report data is of central importance to consumer behavior researchers. As shown in this paper, however, the applied researcher must develop his own indices of response quality in many instances. Where scaled measures exist or can be developed, standard measures of data quality exist and should be used. In those cases where these techniques are not applicable, data quality cannot be ignored. Several checks on the quality of responses have been suggested in this paper, and researchers can doubtlessly think of many others.

The procedures suggested are stop gap measures at best. Clearly, more work is needed in this area to develop standard techniques of assessing the quality of responses to non-scaled measures. Research is needed which has as its primary focus the quality of self-report data. This research should attempt to investigate definitions and measures of data quality and their relationship to standard definitions and measures of reliability and validity. Once these measures and definitions are developed, replication of their operationalization must be carried out across topics and populations.

A second area of research should focus on methods of increasing the quality of data. Instructions, for example, may be tested for their ability to provide better data. The inclusion of a "don't know" option should be tested to determine whether it will reduce item non-response and inconsistent patterns of data. Or, special questions may be included to screen out those respondents whose opinions and answers may be unreliable. For example, respondents could be asked to indicate how certain they are of their judgments. Respondents who are extremely uncertain of their answers may be dropped from further analyses. The efficacy of this approach could be tested by examining the correlation between certainty scores and test-retest reliability scores. Finally, one may conclude that special questionnaires must be developed for different populations of respondents who demonstrate difficult with standard questionnaire formats.

In any case, it is not enough to suggest that researchers develop scales for every measurement situation and use traditional reliability and validity techniques. There are too many instances in applied research settings where variables are not scalable, or where a score is not particularly useful to the researchers. A better tactic is to recognize the issue of data quality and to develop measures to fit the research context.

REFERENCES

Gilbert A. Churchill, Jr., Marketing Research: Methodological Foundations (Homewood, Illinois: Dryden Press, 1976).

Lee J. Cronbach, "Coefficient Alpha and the Internal Structure of Tests," Psychometrika, 16(June , 1951), 297-334.

Hubert S. Field, "Effect of Sex of Investigator on Mail Survey Response Rates and Response Bias," Journal of Applied Psychology, 60(December, 1975), 772-773.

Robert Ferber, "The Problem of Bias in Mail Returns: A Solution," Public Opinion Quarterly, 12(Winter, 1948), 669-676.

Hershey H. Friedman and Larry Goldstein, "Effect of Ethnicity of Signature on Rate of Return and Content of a Mail Questionnaire," Journal of Applied Psychology, 60(December, 1975), 770-771.

Robert A. Hansen and Carol A. Scott, "Improving the Representativeness of Survey Research: Some Issues and Unanswered Questions," Educator's Proceedings, American Marketing Association, 1977, 401-4.

Michael J. Houston and Nell M. Ford, "Broadening the Scope of Methodological Research on Mail Surveys," Journal of Marketing Research, 13(November, 1976), 397-403.

David G. Hughes, Attitude Measurement for Marketing Strategies, (Glenview, Illinois: Scott Foresman and Company, 1971).

Leslie Kanuk and Conrad Berenson, "Mail Surveys and Response Rates: A Literature Review," Journal of Marketing Research, 12(November, 1975), 440-452.

Roger A. Kerin, "Personalization Strategies, Response Rate and Response Quality in a Mail Survey," Social Science Quarterly, 55(June, 1974), 175-181.

Roger A. Kerin and Robert A. Peterson, "Personalization, Respondent Anonymity and Response Distortion in Mail Surveys," Journal of Applied Psychology, 62(November, 1977), 86-89.

G. F. Kuder and M. W. Richardson, "The Theory of the Estimation of Test Reliability," Psychometrika, 2 (September, 1937), 151-160.

Arnold S. Linsky, "Stimulating Responses to Mailed Questionnaires: A Review," Public Opinion Quarterly, 39(Spring, 1975), 82-101.

William J. Lundstrom and Lawrence M. Lamont, "The Development of a Scale to Measure Consumer Discontent," Journal of Marketing Research, 13(November, 1976), 373-381.

J. C. Nunnally, Psychometric Theory, (New York: McGraw-Hill, 1967).

Paul J. Peter, "Reliability Generalizability and Consumer Behavior," Proceedings, Association for Consumer Research, 1976, 394-400.

Milton M. Pressley, Mail Survey Response: A Critically Annotated Bibliography, (Greensboro, North Carolina: Faber and Company, 1976).

Julian Stanley, "Reliability," in Robert L. Thorndike (ed.), Educational Measurement, 2nd ed., (Washington, D.C.: American Council on Education, 1971), 356-442.

Donald S. Tull and Gerald S. Albaum, Survey Research: A Decisional Approach, (New York: Intex Educational Publishers, 1973).

Donald S. Tull and Del I. Hawkins, Marketing Research: Meaning, Measurement, and Method, (New York: Macmillan Publishing Company, 1976).

William J. Whitmore, "Mail Survey Premiums and Response Bias," Journal of Marketing Research, 13(February, 1976), 46-50.

Frederick Wiseman, "Methodological Bias in Public Opinion Surveys," Public Opinion Quarterly, 36(Spring, 1972), 105-108.

----------------------------------------