Process Tracing of Physiological Responses to Dynamic Commercial Stimuli

Piet Vanden Abeele, Catholic University of Leuven
Douglas L. MacLachlan, University of Washington
ABSTRACT - Real-time measures of responses to dynamic stimuli such as TV ads are enjoying a resurgence of interest among practitioners and researchers. This paper assesses the reliability and validity of one such measure, galvanic skin response (GSR) in a particular context. Using a different procedure for analyzing the data (i.e., sample average responses within 3-second intervals as the unit of analysis), we determined that GSR is a relatively reliable measure using adequate sample sizes, but that its convergent and discriminant validity in measuring attention is suspect. We conclude that it might be the pattern of response, rather than the level, that should be investigated.
[ to cite ]:
Piet Vanden Abeele and Douglas L. MacLachlan (1994) ,"Process Tracing of Physiological Responses to Dynamic Commercial Stimuli", in NA - Advances in Consumer Research Volume 21, eds. Chris T. Allen and Deborah Roedder John, Provo, UT : Association for Consumer Research, Pages: 226-232.

Advances in Consumer Research Volume 21, 1994      Pages 226-232


Piet Vanden Abeele, Catholic University of Leuven

Douglas L. MacLachlan, University of Washington


Real-time measures of responses to dynamic stimuli such as TV ads are enjoying a resurgence of interest among practitioners and researchers. This paper assesses the reliability and validity of one such measure, galvanic skin response (GSR) in a particular context. Using a different procedure for analyzing the data (i.e., sample average responses within 3-second intervals as the unit of analysis), we determined that GSR is a relatively reliable measure using adequate sample sizes, but that its convergent and discriminant validity in measuring attention is suspect. We conclude that it might be the pattern of response, rather than the level, that should be investigated.


Physiological responses keep fascinating consumer researchers as single or as complementary measures of reactions to consumer stimuli (e.g., Kroeber-Riel 1979; Stewart and Furse 1982; Rothschild, Yong, Reeves, Thorson and Goldstein 1988; and Bagozzi 1991). Yet these measures have not to date proved to yield significant and/or easily applicable benefits to the consumer research community. Reasons for the limited success of these measures can be sought in (1) measurement deficiencies (e.g., lack of reliability of such measures as applied in typical marketing research settings), (2) deficiencies in the design of data collection and analysis, and (3) insufficient conceptualization and development of theory.

Although physiological responses are naturally recorded as continuous traces over time, other process tracing approaches involving dial-turning or pencil and paper tasks to measure cognitive and affective responses to dynamic stimuli have also been popular over the years. There is currently a revival of interest in process tracing measures and methods (e.g., Aaker, Stayman, and Hagerty 1986; Boyd and Hughes 1992; Vanden Abeele and MacLachlan 1994). This is a valuable development, because many marketing stimuli are dynamic and, as researchers, we need to understand how people react to such stimuli in real time. There are a number of interesting consumer research questions that might be investigated with continuous process tracing measures. For example, what is the impact of various kinds of dynamically presented stimuli on memory of an ad, on product evaluation, and so forth? Industry practice is ahead of academe in the use of such methods, going back more than fifty years (e.g., Peterman 1940). As academics, one of our roles is to provide basic research on the validity of methods and measures used in practical settings.

Process tracing research raises a number of issues, including (1) what traits to measure, (2) how to measure them, and (3) how to analyze such data. Although many of the process tracing measures require cognitive activity of the respondents in the recording of their reactions (e.g., Aaker et al.'s Warmth Monitor), physiological responses involve no cognitive mediation. The latter responses have appeal because of their autonomous, spontaneous, and uncontaminated character.

Our study focuses on a fairly straightforward (in measurement administration) and rather popular physiological measure - epidermal response, also called galvanic skin response (GSR). This measure is known to index physiological arousal and hence to be a potential indicator of the orienting reflex (Ohman 1979; Siddle and Spinks 1979). Orientation or attention is an important precondition for successful processing of externally presented information such as advertising. The attention-getting power of an ad (or specific elements in that ad) is one of the very basic responses of interest to marketers, especially when the target audience is in a situation of low-involvement.

Although a number of other physiological measures have been employed in psychology and consumer research, such as electroencephalographic (EEG) measures (see the debate in Psychology and Marketing reviewed by Cacioppo and Petty 1985), GSR has been one of the most used (Bagozzi 1991). However, the interest there has been typically non dynamic, i.e., examining response to a single, non continuous stimulation such as the fear aroused by the picture of an object. In our research we analyze GSR as a continuous response (a "process trace") rather than as a single reaction to a stimulus. The analysis strategy pursued here is to exploit the variances and covariances of measures as computed across commercial stimuli, where stimulus objects are the content of temporal segments of commercials. Although most previous research has looked at individual traces, we use the average (over sampled individuals) GSR value within segments of exposure time as the basic unit of analysis. By doing so we can, for example, compare response levels within ads (e.g., where is the peak located?), response levels across ads (e.g., which commercial produces the highest peak?), or response patterns across ads (e.g., does this ad's trace replicate that of another ad?).

Thus, the research reported here is a modest contribution to the literature on process tracing methods. We are examining one trait (attention or orienting response), with one method (GSR), and introduce one particular method of analysis (using the temporal segment as the basic unit of analysis).


The propositions under investigation in our study are the following:

P1: GSR produces reliable sample-average traces.

We propose that GSR reliability can be demonstrated for sample sizes that are of reasonable size for both practical application and consumer research.

P2: GSR is a valid process trace measure of attention or orienting response.

There are several sub propositions underlying this proposition: (1)We propose that traces will be sufficiently valid to distinguish real ad traces from neutral non-ad traces. (2) We propose that GSR correlations will not be affected by the nature of the task set (i.e., whether or not other measures are being taken concurrently). (3) We propose that GSR traces produced in response to commercials will be correlated with constructs theoretically linked to attention (e.g., attention-getting, to be described later). (4) We proposed that there will be appropriate correlation with exogenously assessed properties of temporal segments of ads (to be defined later). Finally, (5) we propose there should be a different response pattern to commercials of different character (e.g. ads selected to be "warm" versus "non warm" or "activating" versus "non activating").


Subjects were final-year Belgian business administration undergraduates. The subjects were individually exposed to a videotape of TV commercials and other stimuli in the Consumer Research Lab at Catholic University of Leuven, Belgium. This lab was arranged as a living room in order to put subjects at ease during ad exposure and response measurement.

The treatment to which the respondents were exposed consisted of a videotape containing 12 real TV commercials and multiple insertions of four filler "bogus commercials". The twelve commercials were preselected to represent the extremes on two dimensions, Warmth and Activation, using the same procedure described in Aaker et al. for their dimensions of warmth, information, humor and irritation.

A pretest was performed on a comparable student sample (n=50) who were shown the commercials and asked to give overall ratings on two scales. One of the scales was for Warmth. The other scale was for Activation/Excitement (i.e., for stimulus-induced orienting response). Respondents evaluated 50 commercials, of which 12 were retained, namely three each for the following conditions: (1) high warmth, high activation (denoted WA), (2) high warmth, low activation (Wa), (3) low warmth, high activation (wA) and (4) low warmth, low activation (wa). These twelve ads thus constitute a form of orthogonal design within the study. The scores on warmth and on activation differed highly significantly between the chosen ads at each extreme of the scales; the average warmth and activation ratings of the commercials were nearly (but not perfectly) uncorrelated.

The four "bogus commercials" consisted of a "bouncing ball" animation. A bogus ad was shown after every real ad. This has the advantage not only of allowing us to record the GSR response to "nonsense" material, but also of letting the GSR trace stabilize between commercials. Four videotapes were put together, containing the same real commercials in systematically varied rotations.

Skin response was measured using the ZAK Biosystems EDA/S Module, recording electrodermal conductance at half-second intervals with electrodes affixed on the palms of the non dominant hand. The GSR data handling software is by INTERTEST (Netherlands) and ROGIL (Belgium).

The response to commercials was recorded under one of three conditions. Subjects in the first condition were administered simply the GSR measurement. They were told that we were interested in spontaneous responses to commercials as measured through physiological indexes. Subjects were given time to adjust to the task environment. When their GSR trace had stabilized, they were shown the tape containing the commercials. The sample size in this group was 21 respondents. The measurements obtained from this sample will be denoted as GSR(0), since no other concurrent measure was administered with the GSR.

Subjects in the second condition were given only the task described by Aaker et al., using the instructions for their Warmth Monitor (translated into Dutch). The sample size of this group was 14 respondents. The measures obtained from this group will be denoted as WM(0), since no GSR was administered concurrently.

In the third condition, subjects were given both tasks to perform simultaneously. The subject's dominant hand was assigned the Warmth Monitor task, whereas the non dominant hand was used for GSR measurement. The sample size for this condition was 30 subjects. Two measurements were thus provided by subjects in this condition, which will be denoted as WM(+) and as GSR(+), since in both cases the two measurements were carried out simultaneously.


The unit of observation in the study is a three-second segment of a commercial. Each individual's GSR and WM score corresponds to the maximum observed in the corresponding three-second segment. In the case of GSR, the segments have exactly a three second length, as the time axis is strictly controlled in the measurement. In the case of WM, the drawn trace was divided into as many equal sized segments as there were numbers of three-second intervals (i.e., a 30-second commercial would have 10 segments). Note that there may not be a perfect correspondence between real time and distance along the trace for the WM measure.

Temporal segments of all commercials were coded as zero-one dummy variables for a number of format, execution, and content characteristics. These were used as independent variables in one of the analyses to follow.

All individual responses were ipsatized within the respondent; i.e., they were standardized relative to the mean and standard deviation of the individual's GSR and WM responses, respectively. This removes individual differences in the sensitivity of respondents to the two measures (Ben-Shakhar 1985).

In what follows, comparisons are made and correlations computed based on average ipsatized responses (averaged over subjects) to 3-second commercial segments. The twelve real commercials together totaled 110 segments or 330 seconds of programming. The bogus commercials each lasted 10 segments or 30 seconds. For some analyses, we pooled the measures for GSR(0) and GSR(+) into a single series with 220 observations.


Reliability of GSR

Since we use the sample mean as a unit of analysis, reliability will depend on (1) the size of the sample of respondents, (2) the heterogeneity of respondents' reactions, and (3) the domain of variance considered (e.g., all ads or one ad). We base our assessment of reliability on split-sample (i.e., split-half) correlations. Since the error variance (i.e., the sampling variance of the mean GSR score) is known, the typical reliability formula could be used except for the fact that the error variance is heteroscedastic between commercial segments. The "average" reliability can only be computed if an "average" error variance is entered into the reliability formula. This difficulty is avoided if we (randomly) split the sample in half and compute two parallel series for the mean GSR score for each segment for each subsample. Next, we correlate these series and compute the split-half reliability corresponding to the actual sample size used in the study and estimate the sample size required for a given level of reliability, e.g. .90 (Nunnally, 1978). This gives the results shown in Table 1.

Our conclusion regarding reliability of GSR is that it is high enough at the practiced sample sizes for academic research. However, reliability of the measure seems too low to allow confident marketing decisions (e.g., to compare the strength of response at one moment in the ad to that of another). The reliabilities for GSR are still sufficient to carry out research, but they obviously limit the extent to which other variables can be found significantly correlated with GSR.

Validity of GSR

A first way to consider the issue of validity is to compare the responses to real ads with those to bogus ads. The data consist of sample-average 3-second segment GSR scores for bogus ads and also for three ads which run for a total of 10 segments, the same length as the bogus ads. The average GSR trace for both are portrayed graphically and analyzed by means of a two-way ANOVA. The factors are (1) the type of ad (real versus bogus) and (2) the sequence position of the segment in the commercial (1 through 10). If there is a difference between both GSR traces, this should become obvious in the significant interaction between ad type and segment sequence. A significant interaction implies that the trace differs according to type of ad. Figure 1 shows there is a clear difference in the pattern of GSR according to the type of ad. Each of these traces is of interest in itself, i.e., provides "norms" of the trace to an "average ad" (among the ads we selected) and for the trace to "non-ads". The real ads show a distinctive trace, whereas bogus ads show a typical habituation pattern. These visual conclusions are confirmed by the ANOVA results in Table 2.





From the ANOVA, the sequence effect is marginally significant and the interaction is highly significant.

In Table 3, which is a type of multi-trait, multi-method matrix, we see that GSR(+) and WM(+) are weakly, but significantly correlated at .213. This suggests that there exists a reactivity effect, since other cross-trait measures are not significantly correlated. Thus GSR measures are affected by task set and, to that extent, may not be valid measures when other constructs are being measured concurrently.

Aaker et al. (1986) used evidence that their Warmth Monitor measure (what we have designated WM(0 or +) correlated with GSR to support their contention that the former is a valid measure, since they hypothesized a domain overlap between "warmth" and whatever is measured by GSR. Our Table 3 also examines GSR's concurrent validity with the warmth measure. However, the correlations between different measures are not significant, even after correcting for attenuation due to lack of reliability (except for the same-sample condition, which could be due to reactivity as noted above). We conclude that there is no strong evidence for concurrent validity of GSR with respect to warmth. On the other hand, this result may provide evidence for discriminant validity of the GSR measure.

We examined convergent validity of the GSR trace by correlating segment averages with another measure of attention-getting power, namely a paper-and-pencil Warmth Monitor-type measure, only specifically a monitor of attention-getting. The latter was administered to a separate sample of 20 students. The correlation between GSR(0) and this attention-getting measure was 0.23 (p<.05). Although significant, the correlation is not high, thus not providing much evidence of convergent validity for GSR as a measure of attention-getting.





Since we had coded properties of commercial segments, we were able to see if GSR measures distinguished between attention-getting and non-attention-getting characteristics of the commercials. Construct validity of GSR would be supported if GSR were positively correlated with attention-getting characteristics, but were uncorrelated or negatively correlated with non-attention-getting properties.

Deciding whether the results for the segment characteristics offer supporting evidence for our proposition is difficult in the absence of a controlled experimental design. As an approximation to a hypothesis test, we consider the univariate correlations between judged attention-getting power of the segment characteristics and the sample-mean GSR response. An independent panel of judges, consisting of seven management school research assistants, was asked to consider each characteristic as well as the definition of the attention-getting response. They were then asked to predict for each characteristic whether they expected a significantly positive, neutral, or negative correlation with the attention-getting trace (i.e., what we have been calling the attention-getting counterpart to the WM measure); their responses were coded +1, 0, or -1, respectively. Each characteristic was then coded in terms of the sum of the judges' scores, which then could range from +7 to -7. This obviously crude assessment of the attention-getting nature of the segment properties of the ads is probably most valid near the end-points of the scale, where most agreement occurs among the judges.

The characteristics, their judged attention-getting nature, and correlations with the GSR measure are shown in Table 4.

Because of the high correlation between GSR(+) and GSR(0) series, we pooled both series, to achieve a sample size of 220 observations instead of only 110 for each separate series. The table is divided according to whether the judges on balance agreed that the characteristics were "attention-getting" correlated or not. Only a couple of ad characteristics judged to be positively correlated with attention-getting actually were found to be positively correlated at significant levels. Indeed, several characteristics judged not attention-getting were found to be significantly correlated, only one negatively. Thus there is no real evidence for the construct validity of GSR as a measure of attention provided by this study.

Since several of the properties are not independent in the ads, to study the proposition further, we ran separate dummy-variable regressions with GSR as the dependent variable and with explanatory variables in three groups: ad condition (i.e. warmth/activation category), segment sequence (1st, 2nd, middle, next-to-last, last), and segment properties (specific coded characteristics of the ad execution). Finally, we added the WM measure as a possible covariate. Table 5 contains the adjusted R2s achieved by adding each set of explanatory variables, in turn.

From Table 5 we see that there is a significant effect of segment position. The maximum variation in GSR that could possibly be explained is the reliability of GSR (either .60 or .68 from Table 1). Segment position explained 7% of the variation in GSR, whereas the numerous segment properties contributed only 12% more to explanation of variation. There was no obvious pattern to the significant coefficients of the segment content, format and execution properties, thus again giving no evidence of construct validity of GSR as a measure of attention. Finally, there was almost no variance explained by the addition of a WM variable, indicating no marginal overlap between the "warmth" construct and whatever is being measured by GSR.

As a final validation method, we examined how GSR differs under different conditions of ad activation and warmth. Figure 2 plots GSR for the four conditions implied by high or low activation of the real ads and high or low warmth at 5 different points in the ad sequence.





Note that the horizontal axis variable is not continuous real time, but sequence position, from first segment to last. An ANOVA of the same data, Table 6, shows that there is a main (level) effect of activation, a segment sequence effect (as expected), different trace patterns across time for low vs. high warmth commercials, and a three-way interaction. Thus we see there is evidence for a different GSR response depending on type of ad, but only for activation, not warmth.

A number of other analyses of the GSR trace were conducted, including correlations with warmth and format/execution properties within particular ads and with lagged values of dependent and independent variables. However, all results of such analyses were uninformative or inconclusive.


A primary suggestion of this study is that the appropriate domain of variation for assessment of reliability and validity of GSR (and other process tracing measures) is across different small segments of ads. For this purpose, we suggest that average ipsatized response of subjects within each such segment be the appropriate unit of analysis.





We conclude on the basis of this study that GSR is reliable enough for academic research. However, the GSR score, as we have measured it, does not exhibit evidence of being a valid measure of attention-getting response to dynamic characteristics of TV commercials. At least variation in it cannot be well explained by the variables we have considered in this study. Either we have not measured it well (although we have shown there is a core of systematic variance), or we do not have the right predictors (or have measured the predictors poorly).

An appealing alternative idea is that we should not be considering the level of the GSR trace as the dependent variable of interest, but rather the pattern of that trace. We did find significance of the segment sequence variable in the regressions and found significance of sequence and interaction with sequence in the ANOVAs of Tables 2 and 6.

Future research should be conducted comparing GSR with other attention-getting and similar responses to assess further the convergent and discriminant validity of the measure. But additionally, research should focus on the shape or pattern of the response over commercials.


Aaker, David A., Douglas M. Stayman and Michael R. Hagerty (1986), "Warmth in Advertising: Measurement, Impact, and Sequence Effects," Journal of Consumer Research, 12 (March), 365-381.

Bagozzi, Richard P. (1991), "The Role of Psychophysiology in Consumer Research," in Thomas S. Robertson and Harold H Kassarjian (eds.) Handbook of Consumer Behavior, Englewood Cliffs, NJ: Prentice Hall, 124-161.

Ben-Shakhar, Gershon (1985), "Standardization Within Individuals: A Simple Method to Neutralize Individual Differences in Skin Conductance," Psychophysiology, 22 (3), 292-299.

Boyd, Thomas C. and G. David Hughes (1992), "Validating Realtime Response Measures," in Advances in Consumer Research, Vol. 19, ed. John F. Sherry, Jr. and Brian Sternthal, Provo, UT: Association for Consumer Research, 649-656.

Cacioppo, John T. and Richard E. Petty (1985), "Physiological Responses and Advertising Effects: Is the Cup Half Full or Half Empty?" Psychology & Marketing, 2 (Summer), 115-126.

Kroeber-Riel, Werner (1979), "Activation Research: Psychobiological Approaches in Consumer Research," Journal of Consumer Research, 5 (March), 240-250.

Nunnally, Jum C. (1978), Psychometric Theory. New York: McGraw-Hill, 2nd edition.

Ohman, A. (1979), "The Orienting Response, Attention and Learning: An Information Processing Perspective", in ed. H. D. Kimmel, E. H. van Olst and J. F. Orlebeke, The Orienting Reflex in Humans. Hillsdale, N.J.: Laurence Erlbaum, 443-471.

Peterman, J. N. (1940), "The Program Analyzer: A New Technique in Studying Liked and Disliked Items in Radio Programs," Journal of Applied Psychology, 24 (December), 728-741.

Rothschild, Michael L., J. Hyun Yong, Byron Reeves, Esther Thorson and Robert Goldstein (1988), "Hemispherically Lateralized EEG as a Response to Television Commercials," Journal of Consumer Research, 15 (September), 185-198.

Siddle, David A. T. and John A. Spinks (1979), "Orienting Response and Information Processing: Some Theoretical and Empirical Problems," in The Orienting Reflex in Humans, ed. H. D. Kimmel, E. H. van Olst and J. F. Orlebeke, Hillsdale, NJ: Erlbaum, 557-564.

Stewart, David W. and David H. Furse (1982), "Applying Psychophysiological Measures to Marketing and Advertising Research Problems," in Current Issues and Research in Advertising, ed. Janet H. Leigh and Claude R. Martin, Jr., 1-38.

Vanden Abeele, Piet and Douglas L. MacLachlan (1994), "Process Tracing of Emotional Responses to TV Ads: Revisiting the Warmth Monitor," Journal of Consumer Research, 20 March.