Validating Realtime Response Measures

Thomas C. Boyd, University of North Carolina, Chapel Hill
G. David Hughes, University of North Carolina, Chapel Hill
[ to cite ]:
Thomas C. Boyd and G. David Hughes (1992) ,"Validating Realtime Response Measures", in NA - Advances in Consumer Research Volume 19, eds. John F. Sherry, Jr. and Brian Sternthal, Provo, UT : Association for Consumer Research, Pages: 649-656.

Advances in Consumer Research Volume 19, 1992      Pages 649-656

VALIDATING REALTIME RESPONSE MEASURES

Thomas C. Boyd, University of North Carolina, Chapel Hill

G. David Hughes, University of North Carolina, Chapel Hill

[The authors gratefully acknowledge the assistance of MSI in providing support for subjects and two anonymous reviewers for their helpful comments and suggestions.]

Realtime measures provide a rich source of information regarding consumer attitudes, feelings, and judgements about stimuli. Television advertisements are particularly well suited to this form of measurement. Although popular with practitioners, realtime measures have not received much attention from researchers who struggle with important issues regarding their reliability and validity. The authors report on the results of a validation study using the Multitrait-Multimethod matrix first proposed by Campbell and Fiske (1959). They find evidence of construct validity and make suggestions that will facilitate future assessment of the validity of real time measures.

Consumer researchers have traditionally had to deal with limitations when collecting consumer responses to audio/visual stimuli. Most methods of data collection require either interruption in the stimulus to elicit a response, or subject response after exposure. When the subject is responding to a stimulus, such as a TV ad, interruptions can be intrusive and post exposure responses make it difficult to measure dynamic responses over the period of the exposure, making process tracing difficult. One way to overcome these problems is to use data collection methods that allow subjects to respond continuously over the course of an exposure. One such method is realtime response measurement.

Realtime response measurement has become the generic term for techniques that allow subjects to respond whenever they wish to continuous stimuli such as television commercials, speeches, sales presentations, lectures, presentations to mock juries, and television programs. Systems used recently by commercial researchers required subjects to push buttons, turn dials, or move sliders (Porado 1989). The methods can be traced to the 1930s when George Gallup used his method to help edit Gone With the Wind and Frank Stanton and Paul Lazarsfeld developed a system to test CBS programs. The equipment was bulky and analysis of the output was limited. The few studies reported in the marketing literature have been consulting reports (e.g., Polsfus and Hess 1989). Microcomputer and video technology in the last decade made the equipment portable and extended the methods of analysis, providing new opportunities for academic research.

Figure 1 shows an example of the output that can be generated from realtime measures. It shows results from a control group that watched eleven thirty-second commercials and a weather report. Subjects watched the entire set twice, first dialing for affect (favorability of feelings) and then dialing for cognition (usefulness of information), each time using a continuous rotary dial with a range of 1-100, where 50 is considered 'neutral'. As can be seen from the figure, subjects sometimes showed no similarity between judgements of usefulness and favorability-as in the first ad for Budweiser, and sometimes showed a high correspondence between the two, as in the second ad for Upjohn.

Realtime measures can provide a rich source of information regarding consumer attitudes, feelings, and judgements about stimuli. Television advertisements are particularly well suited to this form of measurement. Although popular with practitioners, realtime measures have not been subjected to tests for reliability, validity, and replication of findings using other measures. An exception is Hughes and Lennox (1990), who reported high week-after retest reliability. Our purpose here is to explore these issues using the Multitrait-multimethod matrix (MTMM) first suggested by Campbell and Fiske (1959). But first, we give an example of how realtime measurement methods can contribute to an existing area of research.

AN APPLICATION

Brand Attitude and Ad Attitude Relationships

One area where realtime methods can contribute to consumer research is the study of the mediating role of the attitude toward an ad (Aad) on the attitude toward a brand (Ab) (e.g., Burke and Edell 1989; MacKenzie and Lutz 1989; MacKenzie, Lutz and Belch 1986). Aad is usually a single measure or scale administered after exposure to the ad. Such measures are subject to errors in recall and tell us nothing about the processing of the ad that produced the final attitude. While some researchers are beginning to note the need to measure processing directly (e.g., Heath 1990), consumer research tends to rely on analysis of variance or path analysis to infer process.

If we measure each subject's processing directly we can examine consumer decision processes more simply. We can then examine how prior brand attitudes affect ad processing and how ad processing revises brand attitudes. This results in a modification of the traditional model which might look like this:

Abt1 -> Processing Adt1 -> Ab't1+

where processing Ad has replaced the traditional Aad as the mediating factor.

Realtime measures can help us capture data that reflects the processing of the ad. Instead of a single post exposure measurement of, say, how the ad made subjects feel, we can have up to 150 measures taken at equal intervals over the course of a thirty second commercial. Because consumers will see an ad many times, we must expand the traditional model for additional exposures.

FIGURE 1

SAMPLE OF REALTIME CURVES SHOWING AFFECT AND COGNITION

Abt1 -> Processing Adt1 -> Ab't1+ -> Processing Adt2 -> Ab't2+

This approach suggests not only that the processing of the ad is important, but also that processing of ads may change over the course of multiple exposures. Understanding how processing changes may provide valuable insights into how and why brand attitudes change over the course of an ad campaign. Further development of this model is beyond the scope of this paper, however, it illustrates one application of the realtime method and shows why we will benefit from its use, provided validity can be demonstrated.

Construct Validation

If realtime measures are to prove helpful in academic research, we must have confidence in their value. Our focus here is on construct validity, specifically, the measurement issues of construct validity that require that measures of a construct must reflect levels of, and variations in, that construct only. Peter (1981) described two ways of viewing construct validity. The first is whether the construct is appropriate and helpful for examining the concepts which underlie the behavior of interest. In this case we must consider whether the constructs of interest in our study, affect and cognition, are appropriate for studying subjects' reactions to advertisements. There is a significant body of work that uses affect and cognition as key constructs in the study of attitudes and attitude change, (e.g., Mackenzie, Lutz and Belch 1989; Petty and Cacioppo 1982) while others have focused on related concepts such as the hedonic and utilitarian components of attitude (Olney, Holbrook and Batra 1991). These works have established evidence that the constructs of interest here are valid and appropriate to the field of study.

The second approach to construct validity mentioned by Peter (1981) asks whether the measures being used are actually measuring the constructs of interest. We focus on this definition of construct validity throughout the remainder of this paper. We are concerned with demonstrating whether realtime response measures of affect and cognition are actually providing us with measures of the constructs of interest. Peter points out that this task is difficult because we cannot directly assess this form of construct validity, rather we can only infer it from evidence provided by testing our measures.

FIGURE 2

THE MULTITRAIT-MULTIMETHOD MATRIX

Churchill (1979) details eight steps for the improvement of marketing measures. Included are the assessment of reliability and the assessment of validity of measures. For these purposes he recommends the use of multiple measures of constructs in order to perform tests of reliability and validity. Churchill recommends against using test-retest measures of reliability, however, the continuous responses from realtime measures are not likely to suffer from the same effects as scale items with discrete response choices. For a discussion of these issues see Hughes and Lennox (1990), who demonstrated reliability using test-retest realtime measures. Our focus here is on validation using multiple measures.

Campbell and Fiske (1959) provide guidelines for the examination of issues of validity. Their description of the Multitrait-Multimethod (MTMM) matrix and its application to the validation of constructs has been widely used and is still central to validation studies today (e.g., Bagozzi and Yi, 1991). In this case, the MTMM matrix examines reliability and two dimensions of construct validity, discriminant validity and convergent validity.

Campbell and Fiske advocate measuring multiple concepts using multiple methods - and computing the correlation matrix between the different concepts (traits) as measured by each method. Three components of construct validity are reflected in these correlations. They are the trait component, reflecting the interdependence between two traits; the method component, reflecting the interdependence between two methods; and error. In examining validity we focus on the convergence of 'maximally independent' methods and therefore hope for evidence of a low method component. Ideally, we should find evidence that the correlation between measures of a single trait using two methods is due to the trait, not because the methods have covariation in their responses.

Traditional methods of validation use single values for comparison with other measures. For example, if we wish to examine the validity of a Likert scale item we might compare it to other Likert scale items intended to measure the same construct, or compare it to a semantic differential scale item. Each of these methods, including the MTMM matrix, uses a measure of the construct taken at some point in time. For dynamic data, such as that provided by realtime measures, the problem is one of adapting the realtime measure to a static measure for comparison. This issue is addressed in the methods section.

The Multitrait-Multimethod Matrix

The MTMM matrix is based on the comparison of multiple measurement methods measuring a set of constructs. Figure 2 shows an example of a MTMM matrix. The example shows a comparison of two methods, each measuring three constructs. Entries in the matrix are the correlations between the measures of the constructs using the method indicated for each quadrant. For example, the value in the upper left hand corner of the top region labeled B is the correlation between construct 1 as measured by method 1 and construct 2 as measured by method 1.

The matrix provides three primary forms of validation: reliability, convergent validity and discriminant validity. The diagonals in the upper left and lower right quadrants, labeled A, represent measures of test-retest reliability. These test-retest correlations should be examined within and across groups (Churchill 1979). The test-retest results are not of primary interest in this study, although they are reported in the results section for completeness.

Convergent validity occurs when two or more different measurement methods produce similar results for the same construct. Region C in Figure 2 shows what Campbell and Fiske (1959) call the validity diagonal. Correlations in this region compare different methods of measuring the same construct and should be significantly different from zero.

The MTMM matrix provides three criteria for the determination of discriminant validity, the ability of a method to reflect the variation in the construct being measured and not other constructs that are separate and distinguishable from it. The upper and lower triangles, labeled B and D in Figure 2, are compared with the values in the validity diagonals to evaluate discriminant validity. Discriminant validity is interpreted using the following comparisons:

1. Values in the validity diagonal (C) should be larger than the values that appear in the regions labeled D. This is the most basic requirement in that it simply means that different measures of the same variable should have a higher correlation than correlations that have neither trait nor method in common.

2. Values in the validity diagonal (C) should be larger than the values that appear in the regions labeled B. These regions represent correlations between measures of a single method taken on different traits, which should be smaller than the correlations between different measures of the same trait.

3. The pattern of correlations in the heterotrait triangles, labeled B and D in the Figure, should be similar in each triangle. This criteria requires that at least three traits (constructs) are measured so cannot be applied to our work here.

One of the weaknesses of many validation methods is that they provide no decision rules for determining when validity is demonstrated (Bagozzi and Yi, 1991). Rather, patterns of correspondence are interpreted, and may provide justification for further examination of the methods being evaluated. An alternative method for evaluation is structural equation modeling, however, it cannot be used with only two methods and traits. In a future study we plan to add a third method and/or trait and compare the results from the MTMM and a structural model.

We examine the validity of realtime measures by interpreting the patterns of correlations in the MTMM matrix. We measure two traits, affect and cognition, using two methods, realtime measurement and Likert scales. We are looking for patterns of correlations, rather than perfect correlations, because we don't believe that a realtime curve measures exactly the same dimensions of the constructs as the Likert scale. The MTMM matrix is well suited to this situation because it focuses on the correspondence in patterns of measures rather than absolute correlations.

RESEARCH METHODS

One hundred and six paid subjects were recruited from a large university in the south. Four subjects were dropped when they failed to appear for the second session of data collection. While student subjects would not normally be used in tests of advertisements, it is appropriate in this case since our interest is the method of measurement. Subjects were randomly assigned to two test groups that saw the same ads. Each group saw the series of ads four times, twice each in two sessions that were one week apart.

The one-hour procedure included an instructional videotape, so all subjects received the same instructions. The video-taped instructions help to ensure that all subjects used the same standards when dialing and used fifty as their neutral point throughout an ad. They then responded to demographic and attitude questions that appeared on the television screen by pushing buttons on the keypad they each had in front of them. Next they dialed continuously as they watched the stimuli. The test stimuli reported here consisted of four television commercials embedded in a series of eleven thirty second television ads and a weather forecast. Some commercials came from cooperating companies while others were taken off the air. Subjects saw all 11 commercials and the weather forecast for a total time of ten minutes. The test ads were shown in the third, seventh, eighth and ninth positions and were chosen from among three sets of ads that represent low, medium and high levels of involvement for subjects. The high involvement ads were for a pizza chain, the moderate involvement ad was for a local bank and the low involvement ad was for homeowner's insurance. Involvement was considered a potential alternate explanation for the variance in correlations. An F test showed that involvement did not have a significant main effect (F=.00) on the correlations in the validity diagonals and so is not considered in further analyses.

The system used for this research is a fourth generation system that has both a 1-to-100 dial for continuous stimuli and a keypad for discrete responses to categorical questions. The first time subjects saw the stream of commercials they dialed for affect on the 100 point scale. Subjects were instructed to start their dial at fifty and dial 'up' when their feelings about the ad were favorable and dial 'down' when their feelings about the ad were unfavorable. The second time they dialed for cognition, evaluating the uselessness-usefulness of the information in the ads. After seeing the stimuli, subjects used their keypads again to answer questions that included the Likert scale criterion measures. The commercials were then played back to them for discussion and debriefing. The same procedure was followed one week later, using the same commercials in the same order, to minimize alternative sources of variation in the test-retest measures.

There are many ways to reduce the realtime curve to a single point so that we can compare them to Likert measures. We want to choose a method of reduction that gives a measure that provides the same information given by the criterion measures. The criterion measures used for validation are retrospective assessments using Likert scale items measuring overall affect and cognition. The scale items were as follows: For affect, subjects responded on a five point scale ranging from very unfavorable to very favorable to the question, how did the ad make you feel? For cognition subjects answered the question, how useful was the information in the ad? They responded on a five point scale which ranged from very useless to very useful. We therefore hypothesize that the Likert items are most similar to an overall evaluation of the ad, represented by the average of the realtime curve values. If this is true, the average of the realtime values should be more highly correlated with the Likert items than any single point from the curve, which represents a response at only one particular moment. We should stress there are other possible summarizations of the realtime curve values and more examination is required before we can be confident what summary statistic best corresponds to a post-exposure scale item.

To test our hypothesis we computed correlations between the Likert item for affect and the four quartile points (i.e., the values at 7.5, 15, 22.5 and 30 seconds) on the affect curve, as well as the average value for the entire curve. The same process was repeated for cognition. As expected, the average of the curve was consistently more highly correlated with the criterion measure than the single values. This provides evidence that the most appropriate value, of those considered for use in the MTMM matrix, is the curve average. The problem here is that our criteria says the best single summary measure should be the one which correlates most highly with our criterion measures. But this also assures us of the best possible results, namely the highest possible correlations in the MTMM. For this reason it is important to stress that this work must be extended with other measures and methods before results can be considered conclusive.

RESULTS

MTMM matrices for the four test commercials were computed for subjects from week one and week two. Due to coding problems the insurance ad from week one had to be dropped. The seven matrices appear in Tables 1 and 2. They consist of correlations between the two measurement methods, Likert scale and realtime, and the two constructs measured, general affect for the ad and a rating of the usefulness of the information in the ad, or cognition. The first MMTM matrix in Table 1 has been partitioned to correspond to the regions shown in Figure 2 to aid interpretation.

In the test of convergent validity there are seven validity diagonals, one for each ad, each with two values, for a total of 14 possible correlations between measures of a construct under the two methods. Eleven of 14 correlations are significant at the .05 level. As can be seen in the Tables, the values ranged from .16 to .73 with an average validity diagonal value of .50. We interpret this as moderately strong evidence of convergent validity and as strong support for the further examination of the realtime method.

The first criterion for discriminant validity involves a comparison of the values in the validity diagonals with the row and column elements in the hetero-method matrix (the values in regions labeled D in Figure 2). Of the twenty-eight possible comparisons, twenty-five were in the correct direction, the validity diagonal elements were higher than the off diagonal elements in the same matrix. Although, in some cases, such as Pizza #3 week 2, the values were very close.

The second criterion for discriminant validity requires comparison of the validity diagonal values with the relevant hetero-trait measures. For example in the first MTMM matrix in Table 1 this means that the .16 and .65 values in the validity diagonal would each be compared with the .24 in column one and .78 in column three. Under this criterion measure seventeen of the twenty-eight relationships are in the desired direction. Of the eleven relationships that are not in the right direction, nine of them are the result of very high correlations between the Likert items for affect and cognition. In six out of seven ads the correlation between Likert measures of affect and cognition were higher than realtime measures (e.g., .78 versus .24 in the Pizza #3 ad, table 1). This indicates a higher covariation between measures of different traits using the Likert method than with realtime. There are several possible reasons for this: First, it may be because subjects are less able to separate affective and cognitive reactions to an ad after exposure - seeking instead for a consistent reaction that combines both. Second, it may be because the Likert scale suffers from an inability to separate affective from cognitive responses. Third, it may reflect demand artifacts or the desire to appear consistent on the Likert items. Finally, it may be the result of the limited number of responses available with a Likert scale. In contrast, realtime measures have previously been demonstrated to discriminate between affect and cognition in Hughes and Lennox (1990).

The third test for discriminant validity was not performed because only two constructs were measured, affect and cognition. In the future we plan to add a third construct, purchase intentions. This addition will allow us to use the third method for evaluating discriminant validity.

Test-retest measures of reliability are included in the Tables for information. The average test-retest correlation is .53. This low number is caused in part by the low scores for the realtime measures of affect, average .25. Without these, the average correlation is .64. There are several possible reasons for the low scores: First, they represent a first and second exposure to the stimuli, and as such are not representative of the same test. Second, test-retest of realtime measures will be more accurately reflected in a point by point matching of subject responses, rather than a comparison of averages. The point by point comparison was used by Hughes and Lennox (1990). Finally, wearout may be a factor.

TABLE 1

FIRST WEEK MULTI-TRAIT MULTI-METHOD MATRICES

DISCUSSION

The results provide evidence that realtime response measures display both convergent and discriminant validity. These findings are strongly consistent with the recommendations of the MTMM model and give direction for future research. Researchers using realtime methods of measurement should consider the following recommendations to allow additional validation studies to be done. First, multiple scale items should be used for the criterion measures as a comparison with the realtime measures. This is important for two reasons: 1) Multiple items with demonstrated reliability will provide a better measure of the underlying construct. 2) If we believe that the realtime response captures many aspects of the construct being measured, then we would expect that multiple scale items should more accurately reflect what we believe the realtime method to be measuring. Second, future designs should be balanced to account for possible order effects. This would apply to the order of measures of affect and cognition as well as the

TABLE 2

SECOND WEEK MULTI-TRAIT MULTI-METHOD MATRICES

order of the ads seen. Third, at a minimum, we should add a third method or construct to the MTMM matrix. This would provide additional validation criteria and also permit the use of Campbell and Fiske's (1959) third test for convergent validity. Finally, we should also explore other means of validation, including the use of verbal protocols and trained judges or raters to evaluate realtime measures.

Other applications of realtime methods provide additional opportunities. Processes other than reactions to advertisements such as political speeches, sales presentations or other interpersonal communications are well suited to measurement via realtime. Realtime measures are also appropriate for examining issues of primacy versus recency effects.

Realtime methods provide rich, process oriented measures of subject reactions to dynamic messages. They provide us with the opportunity to perform individual level diagnostics on the process of interest. They are also preferable to post exposure measures in that individuals are unable to provide retrospective responses about early points in a communication without bias due to having seen the rest of the message (Hughes 1991). With realtime measures we have responses to the communication, up to that point, every fifth of a second. This is useful to researchers who are interested in the process by which attitudes toward messages are formed.

We have provided a demonstration of how methods of evaluating reliability and validity can use measures taken at a point in time to provide evidence of reliability and validity of realtime measures over time. The Multitrait-Multimethod matrix provides evidence that research using realtime measures can have construct validity. We must now explore other methods of validation, ideally with measures that allow us to use all the data in the realtime curve.

REFERENCES

Bagozzi, Richard P. and Youjae Yi (1991), "Multitrait-Multimethod Matrices in Consumer Research," Journal of Consumer Research, 17 (March), 426-439.

Burke, Marian Chapman and Julie A. Edell (1989), "The Impact of Feelings on Ad-Based Affect and Cognition," Journal of Marketing Research, 26 (February), 69-83.

Campbell, Donald T. and Donald W. Fiske (1959), "Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix," Psychological Bulletin, 56 (March), 100-122.

Churchill, Gilbert A., Jr. (1979), "A Paradigm for Developing Better Measures of Marketing Constructs," Journal of Marketing Research, 16 (February), 64-73.

Cook, Thomas D. and Donald T. Campbell (1979), Quasi-Experimentation: Design & Analysis Issues for Field Settings, Boston, MA: Houghton Mifflin Company.

Heath, Timothy B. (1991), "The Logic of Mere Exposure: A Reinterpretation of Anand, Holbrook, and Stephens (1988)," Journal of Consumer Research, 17 (September), 237-241.

Hughes, G. David (1990), "Studies in Imagery, Style of Processing, and Parallel Processing Need Realtime Response Measures," Advances in Consumer Research, Vol. 9, eds. Marvin E. Goldberg, Gerald Gorm, and Richard W. Polley, Proceedings of the Association for Consumer Research. Vol. 17, 461-466.

Hughes, G. David, "Diagnosing Communications Problems with Continuous Measures of Subjects' Responses: Applications, Potential Applications, Limitations, and Future Research," in Current Issues and Research in Advertising, Vol. 13, eds. James G. Leigh and Claude R. Martin, Jr., Ann Arbor, MI: Division of Research, Graduate School of Business Administration, University of Michigan, 175-196.

Hughes, G. David and Richard Lennox, (1990), "Realtime Response Research: Construct Validation and Reliability Assessment," In Enhancing Knowledge Development in Marketing, eds., William Bearden, et al., Chicago, IL: American Marketing Association, 284-288.

MacKenzie, Scott B., Richard J. Lutz, and George E. Belch (1986), "The Role of Attitude Toward the Ad as a Mediator of Advertising Effectiveness: A Test of Competing Explanations," Journal of Marketing Research, 23 (May), 130-143.

MacKenzie, Scott B. and Richard J. Lutz (1989), "An Empirical Examination of the Structural Antecedents of Attitude Toward the Ad in an Advertising Pretesting Context," Journal of Marketing, 53 (April), 48-65.

Olney, Thomas J., Morris B. Holbrook and Rajeev Batra (1991), "Consumer Responses to Advertising: The Effects of Ad Content, Emotions, and Attitude toward the Ad on Viewing Time," Journal of Consumer Research, 17 (March), 440-453.

Peter, J. Paul (1981), "Construct Validity: A Review of Basic Issues and Marketing Practices," Journal of Marketing Research, 18 (May) 133-145.

Petty, Richard E. and John T. Cacioppo (1981), Attitudes and Persuasion: Classic and Contemporary Approaches, Dubuque, Iowa: Wm. C. Brown Company.

Polsfus, Mark and Michael Hess (1989), "The Relationship Between Second-to-Second Response and Direct Response: What is the Link?," Sixth Annual ARF Copy Research Workshop, May 22-23, 1989.

Porado, Philip (1989), "Finding Faster Feedback," Campaigns & Elections, 10 (December), 34-37.

----------------------------------------