Advances in Consumer Research Volume 18, 1991 Pages 558-565
AN EXPLORATORY INVESTIGATION OF QUESTIONNAIRE PRETESTING WITH VERBAL PROTOCOL ANALYSIS
Ruth N. Bolton, GTE Laboratories Incorporated
This paper explores how cognitive research methods can be used to identify defective questions in survey questionnaires. Concurrent verbal protocols are elicited during face-to-face interviews to investigate respondents' cognitive processes as they respond to survey questions. The protocols are segmented and then coded using categories that represent comprehension, retrieval and judgment problems. This procedure is used to pretest two versions of a telecommunications survey. The results indicate that this procedure is useful for evaluating draft questionnaires, and for identifying questions that are associated with information processing problems.
Questionnaire pretesting is a relatively straightforward, low-cost method for detecting problems with a questionnaire. It entails a small pilot study to determine how a questionnaire can be improved to minimize response errors, such as a respondent misinterpreting a question (Converse and Presser 1986). Individual items (i.e., questions) can be pretested for an acceptable level of response variation, meaning, task difficulty, and respondent interest/attention. Or, the overall questionnaire can be pretested for appropriate vocabulary, the order of questions, skip patterns, timing, and overall respondent interest, attention and respondent well-being. These tests enable a researcher to identify and change questionnaire design features to minimize response errors.
Since non-sampling error (i.e., response and non-response error) is the major contributor to total survey error (Assael and Keon 1982), questionnaire pretests should be central to the survey design process. However, as several authors have noted (e.g., Lehmann 1979), researchers frequently neglect to pretest survey questionnaires. Furthermore, academic researchers have failed to examine methodological issues in questionnaire pretesting. In a notable exception, Hunt, Sparkman and Wilcox (1982) found that traditional methods of pretesting (i.e., a face-to-face or telephone interview followed by a debriefing) were effective in identifying some, but not all, types of problem questions.
This paper explores how cognitive research methods can be used to pretest questionnaires. It describes a new method of identifying defective questions and illustrates the method by pretesting two versions of a survey. Concurrent verbal protocols are elicited during face-to-face interviews to investigate respondents' cognitive processes as they respond to survey questions. In contrast with prior questionnaire pretests, the unit of analysis is a speech burst, rather than the complete response to a survey question. Segmenting protocols into speech bursts enables the use of a coding scheme that identifies the macro-processes associated with answering a survey question. Protocols are coded using categories that represent comprehension, retrieval and judgment problems. This procedure is useful for evaluating draft questionnaires and diagnosing problems.
The survey research literature provides guidance about developing questionnaires and multi-item measures of constructs for questionnaires (Sudman and Bradburn 1982; Churchill 1979), but there is very little guidance about pretesting questionnaires. Most researchers agree that the first pretest should be administered by personal interview to elicit concurrent or retrospective protocols, even if the questionnaire will ultimately be administered by mail or telephone. However, they are almost completely silent about how the pretest data should be collected and analyzed. Fortunately, recent interdisciplinary research in cognitive psychology and survey research provides some direction (Jadine, Straf, Tanur and Tourangeau 1984; Hippler, Schwarz and Sudman 1987).
Pretest Data Collection
Researchers have used a variety of methods of pretest data collection. For example, the National Center for Health Statistics (NCHS) has used concurrent protocols, paraphrasing, retrospective protocols, confidence ratings, and response latency measurements (Royston, Bercini, Sirken and Mingay 1986; Royston 1987). Although there is little information available about their methods, they seem to have primarily relied on retrospective protocols. In a study of the processes respondents use in answering behavioral frequency questions, Blair and Burton (1987) also elicited retrospective protocols by asking "How did you come up with that answer?" They argued that a retrospective protocol was an appropriate measure of process because: (a) a concurrent protocol would have altered "natural survey conditions," (b) previous authors have recommended retrospective protocols when they can be taken immediately and the processing episode is brief, and (c) demand effects that would lead respondents to distort the process did not seem likely .
Retrospective reports may be accurate in some contexts (Wright and Rip 1981), but there are potential problems (Ericsson and Simon 1984; Nisbettt and Wilson 1977). Respondents may fail to retrieve and accurately verbalize the processes used, and they may report processes they feel they should have used rather than those they actually used. In contrast, concurrent verbalization is closely related to the actual survey task -responding to the interview questions -- so that the request to verbalize should not interfere/change the respondents' thought processes. For this reason, the elicitation of concurrent protocols seems particularly appropriate for identifying defective questions. However, Hunt, Sparkman and Wilcox (1982) did not find any difference between the effectiveness of concurrent and retrospective protocols. Surprisingly, they did find that verbal protocols collected during telephone interviews were more effective in detecting defective questions than verbal protocols collected during in-store (i.e., face-to-face) interviews.
Pretest Data Analysis
Recent research on the cognitive processes underlying a respondent's answer to a survey question provides a theoretical context for a new method of analyzing pretest data. Building on work concerning the structure of attitudes, Tourangeau (1987; Tourangeau and Rasinski 1988) proposed that the respondent's answer to an attitude question is the product of four stages or macroprocesses: comprehension of the question, the retrieval of relevant beliefs and feelings from memory, the weighing of information to form a judgment, and the selection of an appropriate response alternative. If the content characteristics of a respondent's verbal protocols arise from these four macroprocesses, an analysis of pretest data should be able to identify defective survey questions by coding respondents' processing difficulties.
Although content analysis has been used to code and analyze open-ended questions, it has not been applied to questionnaire pretest data (Weber 1985). In prior research that has coded verbal protocols from surveys, the unit of analysis has been the complete response to a question. Hunt, Sparkman and Wilcox (1982) coded five types of faulty questions: loaded questions, double questions, ambiguous questions, inappropriate vocabulary, and missing alternatives. An error identification was scored if the respondent made comments "which could help in recognizing the error" (p. 272). However, Tourangeau's model suggests that a pretest should trace the thought processes that occur as a respondent forms an answer to a question. Hence, in order to identify difficulties in the response process, the unit of analysis for verbal protocols elicited during questionnaire pretests should be a speech burst or segment.
A PROPOSED PRETEST METHODOLOGY
This section describes a pretest methodology that uses cognitive research methods to identify defective questions. This methodology entails the elicitation of concurrent verbal protocols -- a process tracing technique that requires the respondent to think aloud while making a decision (Ericsson and Simon 1984). The verbal protocols are segmented into speech bursts and coded to identify respondents' difficulties in forming answers to the survey questions.
The core of this pretest methodology is the coding scheme. Coding schemes in consumer research have typically identified information processing strategies, such as attribute comparison versus within brand processes (c.g., Bettman and Park 1980). In contrast, this coding scheme identifies respondents' comprehension, retrieval, judgment and response difficulties. The conceptual framework for this coding scheme is described in the following paragraphs.
A respondent's ability to comprehend a survey question should be facilitated when the survey provides contextual information, such as bridging statements, a logical sequence of questions, groups of related questions and explicit (perhaps lengthy) questions (Tourangeau 1984). If a survey question is defective, respondents frequently indicate comprehension difficulties by asking a question. Hence, a coding category was created to count questions asked by the respondent. In the context of many inter-related survey questions, respondents frequently indicate comprehension difficulties by statements about the similarity of the questions. These statements can be identified by a category of verbal cues such as "I've already answered that question.'
Surveys typically elicit memory-based rather than stimulus- based judgments, necessitating the retrieval of information or earlier judgments. In a typical survey context, the respondent retrieves information -- rather than earlier judgments -because he/she acquired information without knowing that a judgment would be required. Consequently, a respondent's ability to retrieve information in response to a survey question depends on whether the cues provided match the cues available during encoding. The respondent's failure to retrieve information -- that is, forgetting -- can occur when the relevant information was not stored in long-term memory, it cannot be retrieved from available cues, or it is difficult to distinguish from related information (Tourangeau 1984). The respondent's difficulty in retrieving information from memory can be identified from the respondent's verbal protocols concerning the retrieval process: either information was not stored in long-term memory (e.g., "no experience") or information cannot be retrieved (e.g., "don't remember").
There are numerous studies that document the existence of judgment heuristics and response biases in decision-making contexts. In a survey context, response biases may be associated with questions for which respondents experience difficulty forming judgments. Respondents' verbal protocols may provide verbal cues that indicate they are experiencing difficulty forming a judgment (e.g., "difficult to say") or that they lack confidence in their answer to the question (e.g., "maybe"). These two categories of verbalizations should identify defective questions, but they are not completely diagnostic. The respondent may not distinguish between difficulty retrieving information versus difficulty evaluating information.
ALTERNATIVE VERSIONS OF QUESTIONNAIRE
This study does not examine response difficulties because they are related to measurement and scaling issues. However, it is conceptually straightforward to code respondents' use of scale items or use of fixed response alternatives. The coding scheme does code pauses because they indicate unidentified, nonverbalized processing. For example, if many respondents pause when answering a specific question, the question may require extensive information processing so that it is inappropriate for a brief telephone interview.
THE DATA BASE
The previous section described a questionnaire pretest methodology. This section describes how the methodology was applied in a split ballot experiment that pretested a customer satisfaction survey used by GTE Telephone Operations. GTE, like most franchised suppliers of local telephone service, regularly surveys its customers to identify potential service enhancements, to evaluate the effect of enhancements, and to meet public utilities commission requirements. In 1989, GTE conducted an experiment to compare the current version of its residential customer survey with a revised version (that was intended to replace it in 1991). The entire survey instrument was pretested, out (due to space limitations) this study only examines the pretest data from one section of the instrument in which the current version (CV) measured service quality I dimensions with a sequence of "hypothetical" questions while the revised version (RV) measured the same dimensions with conventional perceptual ratings questions. The two versions of this section of the questionnaire are shown in Exhibit 1.
Residential customers were recruited by telephone to participate in face-to-face interviews t lasting approximately forty-five minutes to an hour. t Prior to the interview, respondents were informed of the purpose of the study and were informed that their responses would be audiotaped. As with virtually all pretests, this pretest employed a small convenience sample. Six customers were administered the revised version and fifteen customers were administered the current version (because GTE management was particularly interested in the latter's performance).
The task instructions for the elicitation of concurrent protocols during the questionnaire pretest differ somewhat from the instructions used in problem-solving situations. They asked the respondent to "constantly THINK OUT LOUD while you are deciding about your answers." The interviewers were trained to strictly follow the questionnaire, but they were permitted to backchannel. (Backchanneling is the occurrence of a speaking turn of one word (e.g., "uh-uh") that does not follow a question. This activity is the equivalent of a nod or other indication that the interviewer "hears" the respondent.) The interviewer administered two practice questions similar in format to the actual survey questions. Afterward, the introductory sentences of the survey provided the transition to the actual survey questions. The interviewer used the phrases "Remember, I'm interested in what you are thinking," and "You're doing a good job of thinking aloud" to reinforce the respondents' behavior.
Segmentation, Precoding and Coding
The audiotapes of the interviews were transcribed into electronic form and the transcripts were segmented for analysis. The respondents' speech was segmented into speech bursts or utterances using short pauses, intonation, and syntactical markers (for complete phrases, clauses or sentences) as cues. The transcripts were pre-coded by marking the text to indicate pauses (of three seconds or more) and questions (identified by an interrogative voice inflection).
Each coding category is a list of key words or word strings that have similar meanings or a nonverbal cue (such as a pause). The initial lists were developed from a pretest of a cafeteria survey. Respondents' protocols were examined to (1) identify key words (e.g., evaluate), and (2) generate synonyms (e.g., judge, rate), including colloquialisms. The coded transcripts were reviewed to ensure that the key words were associated with appropriate segments, and the coding categories were revised several times Revisions usually resulted in more restrictive word strings or new word strings. These steps were repealed using transcripts from the pretest of the telecommunications survey. Exhibit 2 shows the lists of key words for each category.
After the interviews were transcribed, segmented and precoded, the transcripts were coded with Miller and Chapman's (1982) computer program, Systematic Analysis of Language Transcripts (SALT). This program was not designed for the analysis of pretest data, but it can be adapted to this purpose. There are several advantages to automatic encoding (Ericsson and Simon 1984). It requires all of the underlying vocabulary and inference rules to be defined and applied consistently; its reliability is perfect; and its robustness to changes in vocabulary and rules can be tested. Since the coding scheme was not intended to completely describe the processing strategy of each respondent, a code was not assigned to every segment. Multiple codes were sometimes assigned to a Single segment when it contained verbal cues for more than one category.
The SALT program was used to count the number of occurrences of each category for each question answered by each respondent. These "counts" were converted into percentages by dividing by the total number of segments uttered by the respondent (to adjust for the fact that some respondents are more verbose than others). Then, an exploratory factor analysis was conducted to investigate the interdependence among the coding categories. Afterward, the percentage of segments in each category of information processing difficulty were tabulated to provide diagnostic information about potential defects in the current and revised questionnaires.
Since the development of the coding categories relied heavily on the face validity of the lists of verbal cues, our analysis began by examining the convergent and discriminant validity of the seven coding categories. The proportions of segments in each coding category for each question were subjected to a principal components analysis with a varimax rotation; The results are displayed in Table 1. Three factors with eigenvalues greater than one explain 55% of the variance in the original seven categories. Appropriate labels for these three factors seem to be COMPREHENSION, RETRIEVAL and JUDGMENT. The comprehension factor has heavy loadings on the categories for questions and "similar" statements; the judgment factor has heavy loadings on the categories measuring uncertainty and "can't judge," and the retrieval factor has a heavy loading on the category that measures "don't remember." Pauses and segments about "no experience" tend to load on more than one factor, indicating that they reflect multiple information processing problems. This factor analytic structure supports the notion that respondents' verbal protocols-reflect three independent underlying information processes.
Tabulation of Coding Categories
The percentage of segments in each category of information processing difficulty are shown in Table 2. These percentages are intrinsically interesting because they are measures of the extent of information processing problems. For example, the average respondent uttered 11 segments in response to the CV of question two. 7.9% percent of these segments were pauses, so that the average customer must have paused once (i.e., 11.4 segments x 7.9% = 0.90 segments) during his response. This statistic demonstrates that most customers engaged in nonverbalized processing while responding about the service representative -- implying that the question did not elicit a "top of mind" response. The following paragraphs examine the pattern of the results in Table 2 to determine whether the coding categories are providing diagnostic information about defective questions. Although statistical tests of differences in proportions between the two versions of each question are reported, the results of these tests should be viewed with caution due to the small sample size.
Question One - Repair. The RV asks three separate questions about repair, whereas the CV asks a single question. Respondents generate substantially more thoughts and they make significantly (p < 0.15) more statements about question similarity in response to the RV questions. Thus, respondents are supplying more information, but they perceive redundancy among the three repair questions in the RV. Furthermore, respondents are slightly less likely to verbalize retrieval difficulties ("no experience") and slightly more likely to verbalize judgment difficulties ("uncertain") in the RV.
Question Two - Service Representative. The second question asks customers to rate the service representative's handling of their requests. The CV provides examples of situations in which the customer might have contacted a service representative about a service order whereas the RV does not. (From the telephone company's viewpoint, service orders include requests for additional directory listing(s), requests for new services such as call waiting, and the installation of additional lines or equipment, as well as changes in service (e.g., flat rate versus measured unit service). Customers typically place these orders by telephoning a company service center and speaking to a representative.) Respondents' verbal protocols indicate proportionally more retrieval problems ("don't remember," p < 0.15) and judgment problems ("can't say," p < 0.15) in response to the RV. This result suggests that customers have difficulty evaluating the service representative's handling of their request because the RV does not supply appropriate cues for customers to retrieve this information from memory.
FACTOR ANALYSIS OF CATEGORY PROPORTIONS
Question Three - Service Change. The RV's failure to provide cues that match customers' encoding of information about service order requests is compounded in question three, in which the customer is asked to rate how service changes are handled. Respondents' verbalizations indicate their inability to retrieve information ("no experience," p < 0.15) in response to the RV. Apparently, customers' recollections of service order contacts are less vivid than their recollections of repair contacts.
Question Four - Directory Assistance. The RV provides a less detailed description of directory assistance service than the CV. Respondents do not seem to experience more comprehension or retrieval difficulties with the RV, but they generate proportionally more segments about their inability to judge (p < 0.15). One explanation for this result is that respondents confuse toll operators (that handle requests for connections to long distance numbers) with directory assistance operators (that handle requests for information about local numbers). They are probably unable to retrieve information to form a judgment because their contacts with operators are rare, brief, low-involvement experiences.
All Questions. Excepting question one, respondents ask fewer questions and make fewer comments about similarly of questions in response to the RV of the questionnaire. This finding suggests that they typically experience fewer comprehension difficulties with the RV. Respondents produce significantly more (p < 0.15) segments indicating that they "don't remember" across all four RV questions. This result is due to fewer retrieval difficulties with questions one and four and more retrieval difficulties with questions two and three. Respondents make more statements about judgment difficulties ("can't judge," p < 0.15) with the RV. Their difficulties arise in questions two and four, in which respondents are provided with few retrieval cues and asked to judge brief contacts with telephone company personnel.
The diagnostic information provided by our content analysis of verbal protocols cannot be obtained from conventional pretest methodologies. Only a coding scheme that focuses on respondents' cognitive processes as they respond to a survey question can determine whether the question poses a processing difficultly. For example, Table 2 shows that customers' responses to both versions of the directory assistance question include relatively frequent (4.8% and 3.3%) mentions of "no experience." This statistic is based on a detailed count of the number of times the relevant key words/strings listed in Table 1 occur in customers' speech bursts. The verbal cues are quite subtle. For example, one customer responded to the CV of this question by saying, "I haven't used it in a long time / but I would say good." Coding schemes that focus on the entire response (e.g., Hunt, Sparkman and Wilcox 1982) are very useful for some types of problem questions, but they do not provide this type of information.
Inspection of respondents' protocols suggests that additional coding categories could be helpful in revising defective questions. For example, the CV seems to create comprehension difficulties for respondents because the questions are framed in a "hypothetical" way and the phrasing "how would you evaluate" is less comprehensible than the more colloquial phrase "how would you rate." For example, one customer responded to the CV of question two by saying, "In other words, you're asking me to do a pre-evaluation of what I might do in the future." Another coding category could be created that counts occurrences of such interpretive comments, perhaps including the word string "you're asking . "
The information from these analyses can be used (and was used) to create a new version of the questionnaire that improves upon both pretested versions. For example, the pretest results suggest that the RV of the service representative question is defective because it does not provide appropriate cues for service order requests. An introductory statement could be added to provide this information.
This study described a new questionnaire pretesting methodology. The content characteristics of verbal protocols elicited in a pretest were found to reflect comprehension, retrieval and judgment processes. Coding categories representing these macroprocesses provided diagnostic information about defective questions in a split ballot pretest of a telecommunications questionnaire. The methodology yielded useful recommendations for further questionnaire revisions.
In this study, respondents experienced comprehension difficulties due to the phrase "How would you evaluate" and retrieval difficulties due to inappropriate cues/phrases in questions about telephone service changes and directory assistance. Hence, the questionnaire pretesting methodology seems better suited to identifying question-phrasing problems than question-sequencing problems. Repeated usage of this methodology may yield some rules for "good" question phrasing. Eventually, standards may evolve that indicate "acceptable" levels of response difficulties.
This methodology complements, rather than replaces, existing methods because it cannot identify all types of defective questions. Content analysis depends heavily on the specification and measurement of theoretically-justified content characteristics. Further research is necessary to improve our understanding of the cognitive processes underlying a respondent's answer to a survey question increases and to generalize the measurement/coding scheme to other survey contexts.
This approach is time consuming and labor intensive. In some instances, sufficient information for questionnaire revisions can be obtained from coding schemes that are used in observational monitoring (Bercini 1989) or coding schemes that -focus on questionnaire design errors (Hunt, Sparkman and Wilcox 1982). However, a content analysis of verbal protocols is warranted in pretests of large-scale government and industry surveys that generate data for policy decisions. For example, GTE's survey programs provide vital -- and expensive -- inputs to corporate decisions. Consequently, the benefits of pretesting GTE's surveys outweighed the associated costs -- which constituted less than 5% of total survey program costs.
Assael, Henry and John Keon (1982), "Nonsampling vs. Sampling Errors in Survey Research," Journal of Marketing, 45 (Spring), 114-123.
Bercini, Deborah (1989), "Observation and monitoring of interviews," Quirk's Marketing Research Review (May).
Bettman, James R. and C. W. Park (1980), "Implication of a Constructive View Of Choice for Coding Protocol Data: A Coding Scheme for Elements of Choice Processes," in Advances in Consumer Research, 7, San Francisco: Association for Consumer Research: 148-153.
Blair, Edward and Scot Burton (1987), "Cognitive Processes Used by Survey Respondents to Answer Behavioral Frequency Questions," Journal of Consumer Research, 14 (September), 280-288.
Churchill, Jr., Gilbert A. (1979), "A Paradigm for Developing Better Measures of Marketing Constructs," Journal or Marketing Research, 16 (February), 64-73.
Converse, Jean M. and Stanley Presser (1986), "Survey Questions: Handcrafting the Standardized Questionnaire," Sage University Paper series on Quantitative Applications in the Social Sciences, 07-063, Beverly Hills: Sage Publications.
Ericsson, K. Anders and Herbert A. Simon (1984), Protocol Analysis: Verbal Reports as Data, Cambridge, MA: MIT Press.
Hippler, Hans-J., Norbert Schwarz and Seymour Sudman (1987), Social Information Processing and Survey Methodology, New York: Springer Verlag.
Hunt, Shelby D., Richard D. Sparkman, Jr., and James B. Wilcox (1982), "The Pretest in Survey Research: Issues and Preliminary Findings," Journal of Marketing Research, 19 (May), 269-73.
Jadine, Thomas B., Miron L. Straf, Judith M. Tanur, Roger Tourangeau (1984), Cognitive Aspects of Survey Methodology, Washington: National Academy Press.
Lehmann, Donald R. (1979), Market Research and Analysis, Homewood, IL: Richard D. Irwin, Inc.
Miller, J. and Chapman, R. (1982), "Systematic Analysis of Language Transcripts (SALT)," Unpublished manuscript, University of Wisconsin.
Nisbett, Richard and Timothy Wilson (1977), "Telling More Than We Know: Verbal Reports on Mental Processes," Psychological Review, 84 (May), 231-59.
Royston, Patricia (1987), "Application of Cognitive Research Methods to Questionnaire Design, Paper Presented at the Society for Epidemiological Research Twentieth Annual Meeting.
Royston, Patricia, Deborah Bercini, Monroe Sirken and David Mingay (1986), "Questionnaire Design Research Laboratory," Paper Presented at the Meetings of the American Statistical Association.
Sudman, S. and N. Bradburn (1982), Asking Questions: A Practical Guide to Questionnaire Design, San Francisco: Jossey-Bass.
Tourangeau, Roger (1987), "Attitude Measurement: A Cognitive Perspective," in H. Hippler, N. Schwarz, and S. Sudman (Eds.), Social Information Processing and Survey Methodology, New York: Springer-Verlag, 149-162.
Tourangeau, Roger and Kenneth A. Rasinski (1988), "Cognitive Processes Underlying Context Effects in Attitude Measurement," Psychological Bulletin, 103 (3), 299-314.
Weber, Robert Philip (1985), Basic Content Analysis, Sage University Series on Quantitative Applications in the Social Sciences, 07-049, Beverly Hills and London: Sage Publications.
Wright, Peter and Peter D. Rip (1981), "Retrospective Reports on the Causes of Decisions," Journal of Personality and Social Psychology, 40 (4), 601-14.