Statistical Power and Effect Size in Consumer Research

Alan G. Sawyer, The Ohio State University
ABSTRACT - Statistical power and effect size are not considered sufficiently by consumer researchers. Better attention to these two factors can improve the planning, execution, and reporting of consumer research. Suggestions are offered about how to increase effect size and improve statistical power.
[ to cite ]:
Alan G. Sawyer (1982) ,"Statistical Power and Effect Size in Consumer Research", in NA - Advances in Consumer Research Volume 09, eds. Andrew Mitchell, Ann Abor, MI : Association for Consumer Research, Pages: 1-7.

Advances in Consumer Research Volume 9, 1982      Pages 1-7

STATISTICAL POWER AND EFFECT SIZE IN CONSUMER RESEARCH

Alan G. Sawyer, The Ohio State University

ABSTRACT -

Statistical power and effect size are not considered sufficiently by consumer researchers. Better attention to these two factors can improve the planning, execution, and reporting of consumer research. Suggestions are offered about how to increase effect size and improve statistical power.

INTRODUCTION

Researchers should be concerned with at least four types of research validity--internal, external, construct, and statistical conclusion validity (Cook and Campbell, 1979). Consumer researchers have long been properly concerned wit internal validity (e.g., Banks, 1965) and external validity (e.g., Permut, Michel, and Joseph, 1976; Sawyer, Worthing, and Sendak, 1979). There has also been a recent increase in attention to construct validity (Churchill, 1978; Cohen 1979; Heeler and Ray, 1972; Sawyer, 1975; Shocker and Zaltman, 1977). Statistical conclusion validity has not received much recent attention in the marketing literature This void is probably due to the apparent elementary nature of statistical conclusion validity and the belief that most published marketing researchers are sufficiently competent in this area. Most written to lent about statistical conclusion validity has involved the proper use and interpretation of advanced techniques or comment on specific research reports. This article addresses a very basic aspect of statistical conclusion validity -- statistical power.

Statistical power is the probability that a null hypothesi will be correctly rejected by a statistical test. In other words, statistical power is a measure of a research study' statistical sensitivity to detect an effect or relationship that actually exists. The implications of this basic concept for marketing and consumer research are very important. This article reviews the factors that determine statistical power and the important implications of both low and high power. Many examples of published consumer research are cited to illustrate the potential difficulties of failure to explicitly consider and report statistical power and the related concept of effect size. Finally suggestions are offered about how to improve statistical power.

Four major issues involving statistical power and effect size estimates are relevant to consumer research. First, a research design ought to have adequate power to detect ar anticipated effect size. Second, the decision about whether to replicate a study that gives results directionally consistent with hypotheses but statistically insignificant should be based on the level of power and effect size in that study and whether either can be increased feasibly in the replication. Third, any credible decision to "accept the null hypothesis" of no effect must be accompanied by a highly powered research design that reveals only a very small effect size. Finally, because large sample sizes usually allow even very small effects to be statistically significant, it is especially important with highly powered research designs to measure and report effect sizes in addition to statistical significance. [For more details about the determinants of statistical power and some definitions of effect size estimates see Cohen (1977) and Sawyer and Ball (1981).]

LOW STATISTICAL POWER

The statistical power of a contemplated research design should be estimated before the data are gathered. If a design has unacceptably low power to detect the effect of interest, the design ought to be changed to improve the power. If limited resources preclude a satisfactory level of power and if statistical significance at a low Type I error rate is desired, the research is probably not worth the time, cost, and effort and should be abandoned. at the very least, a researcher who decides to conduct a study with low statistical power should be fully aware of the low power and anticipate that rejection of the null hypothesis at conventional Type I error rates is unlikely even if the null hypothesis is false. If the null hypothesis is not rejected by the selected statistical test in a research design with low power, it is very difficult to assess whether there is, in fact, no (a negligibly small) relationship in the population or the research was not sensitive enough to detect a relationship that is actually present in the population.

It seems fairer to criticize low power in basic theoretical research than in applied research for several reasons. As theory tests typically build on one another, estimates of effect size from prior research are more likely to be available. Also, much theory research is done in the lab where more flexible and more powerful research design is possible. The reduced emphasis on external validity often allows the use of less expensive subject populations. Finally, theoretical research is more likely than applied research to involve a need to test for the presence of even relatively small effects.

An example of underpowered theoretical consumer research is the study of Sternthal, Dholakia, and Leavitt (1978) which was designed to test cognitive response theory predictions about the persuasive effects of source credibility and initial opinion. Hypotheses about the effects on attitudes were mostly supported; however, statistical significance was not found for important predictions about cognitive and behavioral responses. The insignificant cognitive response results were especially disappointing because the predicted effects were crucial to the theoretical explanation of the attitudinal effects. Sternthal et al. suggested that the insignificant results may have been due to a low level of generated counterarguments. It also seems reasonable to place some blame on the very low statistical power. Only 37 subjects were used--17 who had a positive prior opinion and 20 who were initially negative. Each of these subjects was assigned to either a moderate or high source credibility condition. A critical theoretical test was the difference in the number of counterarguments generated by initially negative subjects exposed to high and moderate credibility sources. In fact, as predicted, the moderate source subjects generated more counterarguments (.5) than those in the high credibility source condition. Compared with the estimated common population standard deviation of 1.49, this difference in counterarguments amounted to an effect size of d = .34. However, the power of a t-test to detect an effect size of .34 with 10 subjects in each cell is only 19% with a one-tailed Type I error rate of 5%. Even for effect sizes defined as medium (d= .50) or large (d- .80) by Cohen (1977), power would have been only 29% and 53%, respectively.

Although predictions about behavior were less important to the cognitive response theory, Sternthal et al.'s results showed pronounced differences in predicted directions. For positive subjects the moderate source (56%) resulted in greater behavioral comliance, as predicted, than the high credibility source (25%); for negative subjects the predicted difference in the opposite direction was found (10% vs. 30%). Probably Sternthal et al. did not expect differences larger than these results; indeed, compared with the commonly low effects of credibility in past research (Sternthal, Phillips, and Dholakia, 1978), the observed differences (w=.30 for positive subjects and w=.23 for negative subjects) were remarkably high. With the low sample sizes, however, power to detect the behavioral compliance effect size that would be estimated from the data for positive subjects was only 23% for N=17. Certainly a greater sample size could have provided a more adequate test of the impressively developed theory. For example, given an effect size of w-. 30, a sample of 87 subjects in each prior opinion condition would result in 80% power at a 5% Type I error rate for the chi square tests on behavioral compliance.

Other examples of underpowered statistical tests can be found in consumer research. Despite small effect sizes for attitudes and beliefs in prior corrective advertising research, Sawyer and Semenik (1978) randomly assigned only 142 subjects to 24 experimental conditions. An important ad appeal x order of measurement x time of delayed measurement interaction term had only a 42% chance of detecting a medium effect and only a 9% chance of detecting a small effect (which was quite likely given past research results). Either the design should have been simplified into fewer cells or greater cell sizes should have been used. Monroe and Guiltinan (1974) studied relationships between lifestyles, planning and budgeting practices, attribute importance, and perceptions concerning alternative retail stores. Although no similar research efforts were available to get confident estimates of effect size, very high correlations among these variables were unlikely; hence, the likelihood of large differences between correlations was even lower. However, difference in cross-lagged correlations were tested in a total sample of 169 as well as subpopulations of 76, 50, and 30. A two-tailed test of the differences between two correlations of a small (medium) size has power of only 15% (79%) for the total sample and only 9% (44%), 8% (31%), and 7% (20%) for the smaller samples at a Type I error rate of 5%.

REPLICATIONS AND POWER

If a study fails to reject a null hypothesis, classical statistics leaves any decision in "suspended judgment" (Hays, 1961, p. 263). Post hoc calculations of estimated power enable a researcher to estimate the likelihood that low power is to blame. If power is very low, the researcher (or someone else) might well decide to replicate the study albeit with greater power. Replicating an experiment enables a researcher to estimate more confidently the likely effect size and to provide power adequate to detect that size effect. However, Tversky and Kahneman (1971) conclude that researchers mistakenly tend to underpower attempted replications of results that, although statistically insignificant ought to be greater than that of the replicated research. Some marketing researchers may not have heeded this advice

For example, Wheatley and Oshikawa (1970) attempted to replicate two studies which both gave nonsignificant directional evidence that moderate fear advertising appeals were more effective than positive appeals for a low anxiety audience. Despite the prior knowledge of, at best, very small effects in two previous studies, Wheatley and Oshikawa used a 2 (anxiety) x 2 (ad appeal) experimental design with 96 subjects that had only a 15% probability of detecting a small effect (f=.10) with a Type I error rate of 5%. Because of the past results and the very low power to detect the expected small effect, the fact that Wheatley and Oshikawa failed for a third time to find a significant interaction effect of appeal and anxiety is not surprising.

McAlister (1979) properly chose to consider the statistical power of research which failed to reject the null hypothesis. She tested two models of choice of multiple items from a product class. The first model was supported by the empirical results. However, the second model which involved high school students' choices of colleges to which they would apply was not a statistically significant improvement over the null hypothesis of a random choice process. McAlister appropriately noted that, given the observed effect size, of 32 choices per subject provided insufficient power. Such information alerts the reader that a replication might successfully support the hypothesized "lottery" model if it were feasible to increase either sample size or effect size. A larger effect size might be achieved by a wider manipulation of the students' perceptions of the probability that they would be accepted by the colleges in question. If the effect size were doubled to .066, the sample size suggested by Srinivasan's (1977) analysis to reject the null hypothesis at a Type I error rate of about 10% would be 34--only two more choices per respondent than were used by McAlister. Alternatively, about 73 choices would be needed to reject the null hypothesis with the effect size of .033 attained in her study.

"PROVING THE NULL HYPOTHESIS"

If no statistically significant effect of an independent variable is found, a researcher must decide whether there is, in fact, no or a negligible effect or whether the insignificant result is due to some method deficit. A conclusion in favor of the null hypothesis is usually difficult to defend but should be acceptable in some instances. Such a conclusion requires evidence of a large and valid manipulation and reliable, valid measurement (see Carlsmith, Ellsworth, and Aronson, 1976; Cook and Campbell, 1979; Cook et al., 1978; Greenwald, 1975a). An additional requirement is high statistical power. High statistical power lends credibility to any results favoring the null hypothesis for that particular research design and method and allows attention to be focused properly on alternative plausible hypotheses involving internal or construct validitY .

An example of research that concluded in favor of the null hypothesis despite very low power is Hawkins' (1970) empirical evaluation of the effects of subliminal advertising on choice behavior. Two tests, one involving 20 subjects and a follow-up replication with 10 subjects, found no statistically significant effect. In the latter test, six of the 10 subjects chose the subliminally advertised product. Although such a difference was not statistically significant, the low power of the sign test with only 10 subjects 10% for a small effect (g-.05) and only 25% for even a medium effect (g-.15) hardly provided sufficient evidence against the alternative hypothesis.

Since one can never truly "prove" a null hypothesis, a researcher who is unable to reject a null hypothesis is especially responsible to note the size of the obtained effect. For example, Bush, Hair, and Solomon (1979) concluded that, with one minor exception, there was no relationship between prejudice and reactions to print ads with models of different races. Only-a few odd interactions were statistically significant. Consistent with the recommendation to also quantify effect size, Bush et al. calculated @2 (all lower than 2%) for the few significant effects and allowed the reader to judge the importance of the obtained effect size. The authors might have further enlightened the reader by presenting the power of their research (which was quite reasonable with 237 subjects) to detect a given effect size.

An example of how attention to effect size can help to identify cases in which statistically insignificant result ought not be ignored comes from the consumer information processing research of Jacoby and his colleagues. Jacoby, Speller, and Kohn (1974) manipulated the number of attributes per brand (2, 4, 6) and the number of brands (4, 8, 12) to study the effects of information load on brand choice. Seventeen subjects were assigned to each of the nine cells, but only the 50 subjects who chose the best brand (out of the total of 153) were analyzed. A hi square test of the relationship between the two variables was statistically significant (X2 = 15.294; p < .005). Estimation of an effect size from the data would have reveal ed an association (w = .55) slightly greater than what Cohen (1977) defines as a large effect. The power to detect a large (w = .50) effect by a chi square test with N- 50, 4 d. f., and a Type I error rate of 5% equals 82%; power to detect a medium (w = .30) effect with that design equals 36%. A modified replication of this experiment (Jacoby, Speller, and Kohn 1974) used the same product but different levels of both number of attributes per brand (4, 8, 12, 16) and number of brands (4, 8, 12, 16). However, the sample per cell in this replication was decreased from 17 to 12. With a lower usable sample size (n - 45) and greater degrees of freedom (9 d.f.) than the first experiment, this second design had power of only 63% to detect the large effect that might have been anticipate from the previous study and power of only 23% to detect a medium effect size. Jacoby, Speller, and Kohn's results in the replication indicated no significant effect (X2 = 13.76, 9 d.f., p < .20). Lack of significance was attributed to the different levels of the independent variables. However, regardless of any differences in design or other problems, the lack of statistical significance might have been due to the lower power. Moreover, a] effect size measure would have revealed that the estimate of effect size in the second experiment was just as large as that in the first. Effect size as estimated by w equal ed exactly.55 in both experiments. Even though w values cannot be compared exactly across research studies because w is not a measure of proportion of explained variance and can be expected to be larger for chi square analyses with greater degrees of freedom, the obtained effect size in the second experiment is reasonably large. Calculation of obtained effect size might have convinced Jacoby and his colleagues not to dismiss the effect in the second experiment as insignificant and unworthy of further attention.

The two research studies by Jacoby and his colleagues illustrate how the use of the appropriate statistical test can improve power and sometimes even alter the statistical conclusion. The chi square statistic was used to test the significance of the obtained results, but those data were analyzed in an unusual fashion. The chi square tables analyzed number of brands by number of attributes per brand matrices for only those subjects who chose the "best" brand. Thus only 50 of 153 total subjects were used in the statistical analysis in the first experiment and only 45 of 192 were used in the second. Any analysis that disregards some subjects is especially when the omitted subjects amount to more than two-thirds of the total. Such deletion obviously hinders statistical power. If instead of analyzing subjects who chose the best brand the authors had analyzed those who did not choose the best brand, statistical power would have been higher because of the greater number of subjects. However, such a decision would have been as arbitrary and incorrect as the other.

An analysis that is both more appropriate to the research question and higher in statistical power is a three-dimensional contingency table analysis (Winer, 1971, p. 855-9). Such an analysis can utilize all the subjects and can test for effects of number of brands by choice, number of attributes by choice, and brands by attributes of choice. Table L reveals that this reanalysis of Jacoby's first experiment finds statistically significant effects of brands by choice and brands by attributes by choice. The exact hypotheses in these two exploratory studies were not stated. However, the implicit conclusion in the first experiment that chi square indicates that both the number of brands and the number of attributes are significantly related to best brand choice (p. 65) is misleading. Only the latter term is significant in the reanalysis of the data. Perhaps even more important is the fact that, contrary to Jacoby's less powerful statistical analysis, the re-anaLysis of the second experiment found statistically significant effects of both brands by choice and brands by attributes by choice along with large effect size than in the first experiment. From the previous discussion, remember that the statistical power of the published analysis of the second experiment is lower than the power of the first. This lower power may have misled the authors to conclude that there were no statistically significant effects. In the reanalysis in Table 1 the power to detect a medium (w - .30) effect is 76% for both experiments.

TABLE 1

THREE-DIMENSIONAL CONTINGENCY TABLE ANALYSES OF TWO INFORMATION LOAD EXPERIMENTS

HIGH STATISTICAL POWER AND EFFECT SIZE

Very high power highlights the important limitations of any statistical test of nonzero effects (see Meehl, 1967). Statistical significance for a given Type I error rate and effect size is merely a function of the sample size. An effect of even a very small size will almost certainly be statistically significant with a sufficiently large sample, but a relatively large effect may not be judged statistically significant with a small sample. Although statistical significance tests are necessary to protect against Type I errors, attained significance levels should never be regarded as a measure of the magnitude of an effect (Bakan, 1966). Instead, additional descriptive statistics about the extent and form of an effect should be reported. Although several researchers (e.g., Green, 1973; Rosekrans, 1969) have advocated the use of proportion of explained variance measures, the reporting of such measures is still not common with statistics other than regression/correlation analysis. The aim of these advocates is to turn the focus of attention away from statistical significance to effect magnitude--which, unlike the former, is not determined by sample size.

As Cohen (1977, p. 78) states in advocating the use of effect size estimates to complement traditional statistical significance tests' descriptions of a research study's effects, "the only difficulty arising from the use of PV (proportion of explained variance) measures lies in the fact that in many, perhaps most, of the areas of behavioral science, they turn out to be so small!" [Discussion about the problems of effects size estimates can be found in Sechrest and Yeaton (1981a,b), Latour (1981), and Sawyer and Ball (1981).]

Some researchers are appropriately using proportion of explained variance measures such as w2 in nonregression studies. For example, Belk (1974) and Lutz and Kakkar (1975) used very powerful within-subject analysis of variance designs to study situational effects. The former study analyzed 10,000 observations and the latter had 3150. Because power to detect even small main effects was no lower than 98%, statistical significance tests were of minor interest and the major focus was the size of the independent variables' proportion of explained variance as measured by W2.

Golden's (1979) study of comparative advertising illustrates how a reliance on statistical significance and failure to report effect size might be misleading. An admirable improvement of her design over past studies of comparative advertising was the greater statistical power. The use of 594 subjects and an analysis of covariance which controlled for the effects of brand loyalty led to power much higher than that in most consumer research experiments. The power to detect even a small (f=.10) effect ranged from 40% to 67% (depending on the number of degrees of freedom in the numerator of the particular F-test). This high power undoubtedly helped Golden to find six of 7; tested effects significant at p < .05. The primary findings of Golden's research might have been that, given the high power, there were few statistically significant effects and--more that the few significant effects were so small. However, Golden did not report the size of statistically significant effects. Measurement and reporting of effect size would have enabled a reader to understand that the only significant analysis of variance term involving comparative advertising (the interactive effect of comparative ad by copy theme on purchase intention) explained only 0.7% of the total variance.

Many surveys systematically vary factors that may affect the response rate. Such research is usually quite high in power due to the high sample size typically employed in surveys. Given this high power, the multitude of potential factors to increase survey response rates and the need for some theory to conceptualize this research area (Houston and Ford 1976), research in this area ought not limit description of results to statistical significance but should also describe effect size. For example, Childers and Farrell (1979) varied both the page size and number of sheets of a mail questionnaire sent to 440 people and found that smaller page size resulted in a significantly (p = .033) greater response rate. Calculation of w2 would have shown that, despite the statistical significance, this variable explained only 0.8% of the total variance. Although point estimates of response rates may be a more informative way to describe effect size, proportion of explained variance measures will be an especially helpful expression of effect size when other indices of response quality (e.g., Hansen and Scott 1978) are examined.

It should be emphasized that highly powered research studies such as Golden or Childers and Farrell should be applauded. High power should certainly not be criticized; in fact, the high power is very commendable. Nor should it be inferred that small effect sizes are not important. The important point is that, because highly powered research is often able to detect small effect sizes as statistically significant, researchers ought to describe the sizes of obtained effects in addition to the statistical significance.

If the power of a contemplated research design appears to be very high, sample sizes might be reduced to save time and cost. If a very large effect size is anticipated, the researcher should consider whether the large effect is so obviously present that an alternative, less obviously true but more meaningful null hypothesis involving a smaller effect size ought to be tested. For example, rather than simply determining that a large effect is different from zero, one could more usefully determine whether the effect is at least as large as some minimum value predicted by another model or theory (see Armstrong, 1979; Morrison and Henkel, 1970; Platt, 1964). For example, modelers of consumer behavior usually do not merely test whether a given model predicts better than random events. Instead, new models are compared with others that have shown previously to have good predictive ability (e.g., Givon and Horsky, 1979). In addition to leading to a more useful null hypothesis, comparing two or more imbed led models usually alleviates the problem of large sample size because the focus is on differences between the test statistics (such as x2) rather than the absolute levels of the test statistic. In addition, effect size estimates such as the contingency coefficient for chi square analyses should be reported to describe both the absolute and incremental fits on the tested models (e.g., Srinivasan and Kesavan, 1976).

Comparison of competing models is especially advised in analysis of linear structural equation models with latent variables (Bagozzi, 1977). If a proposed model is instead analyzed in isolation from alternative models using, for example, Joreskog's LISREL, that model is "proved"--unlike with most statistical tests--by accepting the null hypothesis of no difference between the variance-covariance matrices implied by the proposed model and the sample data. rhus, large sample sizes are needed to provide properly conservative Type I error rates, yet with very large samples nearly all models will be rejected. This conflict -an be considerably alleviated if recently developed measures of effect size (Bentler and Bonett, 1980; Fornell md Larcker, 1981) are used to augment the chi-square tests of significance. For example, Phillips and Bagozzi (1980) analyzed a structural equation model that was rejected by -he chi-square statistical test. However, this statistical rejection was almost entirely due to the quite high sample size (n = 1531) since smaller subsamples did not statistically reject the model, the absolute residuals between the two variance-covariance matrices were small, and the Bentler and Bonett's goodness-of-fit index (calculated to be .967) indicated that there was only a trivial difference between the hypothesized model and the actual data.

HOW TO IMPROVE STATISTICAL POWER

Fortunately, there are many ways to increase statistical power in consumer research. The most obvious mode is to increase sample size. Although larger sample sizes will increase statistical power, the amount of the increase depends on the effect size in question. Even if resources (or increased sample size are available, it is usually desirable to look first at alternative methods. Cohen (1973) argues that theoretical researchers ought to concentrate on effect size. After becoming aware of the (often very low) magnitude of effect size, a researcher often can most efficiently increase power by "developing insights which lead to research procedures and instruments which make effects measurably large enough to be detected by experiments of reasonable size.... (A researcher) must apply himself toward making (effect size) bigger rather than passively wanting to detect it regardless of how small the effect is and needlessly 'strive for significance"' (p. 228-9).

Methodological improvements in all areas can increase detectable effect size and, hence, power. Measurement and treatment reliability can have large effects on power. Boruch and Gomez (1977) demonstrate that a research design with perfect reliability and a treatment implemented with no errors had 92% statistical power, but the power was reduced to only 54% when reliability was reduced to .80 and the implemented treatment overlapped only 75% with the treatment as conceptualized. Measures less biased by uncontrollable differences can also add to power. For example, Eskin and Baron (1977) used the ratio of sales per store to the average sales per store to control for store size, and Chevalier (1975) employed sales per 1000 customers to control for weekly store traffic fluctuations.

Research design alterations can also improve power. Within-subject designs and analysis of covariance are very likely to reduce unexplained error (see Doyle and Fenwick, 1975; Greenwald,1976). Another way to improve power is to increase the homogeneity of the sample by blocking or matching designs (Day and Heeler,1971; Lodish and Pekelman 1978). Overall and Dalal (1965) show how to choose among alternative designs if prior estimates of mean squares of different experimental treatments are available or can be estimated. For example, the decision among design alternatives of increases in either the number of test market territories, retail stores audited per test area, or repeated sales audits per store to increase measure reliability could be made by choosing the most powerful combination of these alternatives that does not exceed a given research budget.

Stronger and better controlled manipulations of independent variables also can improve power via increased effect size In theoretical investigations, one ought to try initially to provide wide variations in the independent variables and only after finding statistically significant effects with the large manipulations, try to assess -the effects of more subtle differences. This practice is also advisable in applied field research because even treatments intended to have a large impact may not be fully implemented and thus may result in a much smaller than anticipated effect. Boruch and Gomez (1977) demonstrated the adverse effects of "structural imperfections" (p. 424) which was their term for poorly administered treatments. When combined with the reduced measurement and treatment reliability, described above, a 75% implementation of an experimental treatment further reduced statistical power from 54% to 38%.

Finally, statistical procedures can affect power. Advocates of Bayesian statistics might assert that replacing the classical inference-approach would be an appropriate method of alleviating statistical power problems. Bayesian statistics do not force a reliance on a yes-no rejection of the null hypothesis and do not have to assume implicitly a high prior probability that the null hypothesis is true (see Phillips, 1973). Bayesian statistics is also much more flexible about statements about the probability that the null hypothesis is true.

Greenwald demonstrates the advantages of Bayesian hypothesis testing over the classical approach. In one study which concluded that there was no effect of an independent variable because of a statistically insignificant result, Greenwald (1975b) illustrated that a Bayesian reanalysis of the data estimated the odds in favor of the research hypothesis at nearly 5:1. A second paper presented two studies which both failed to reject the null hypothesis. Greenwald's (1975c) reanalysis used a flat prior probability distribution for the first experiment and the resulting posterior as the prior for the second experiment. The results indicated odds of more than 23:1 in favor of the null hypothesis of a minimum effect size for one independent variable and almost 8:1 for a second. A third example (Greenwald 1975a) calculated the odds in favor of the null hypothesis as high as 249:1.

More developments of statistical procedures that are either more amenable to low sample sizes or more powerful is needed (Winer and Ryan, 1979). Similarly, simulations of the power of statistics for which power distributions are not known (e.g., Blattberg and Sen, 1973), and of the relative power of alternative research strategies (e.g., Srinivasan, 1977) would be very helpful. In addition, as Latour's paper in this session demonstrates, improved measures of effect size need to be developed.

A final statistical method to increase power is to combine several studies in a type of "meta-analysis." Keppel (1973) shows how analysis of several replications--each of which finds directional but insignificant support for a null hypothesis--results in a rejection of the null hypothesis when the replications are combined into a replication X effects ANOVA. Several alternative methods of combining replications are available (Rosenthal 1978). As with single studies, it is wise to examine effect size as well as statistical significance. Hyde's (1981) study demonstrates how "it is possible to have a moderately reliable psychological phenomenon (e.g., one that appears in 50% or more of published papers on the topic) that is nonetheless small" (p. 900). Effects cited as very consistent by reviewers were calculated by Hyde to account for only 1% of the variance.

CONCLUSION

Two basic aspects of statistical conclusion validity-statistical power and effect size - have not received sufficient attention in consumer decision research. The goal of this paper is to remind consumer researchers about the problems that may be encountered if statistical power and effect size are ignored. Greater attention to statistical power and effect size can only improve research in consumer behavior and the conclusions from that research.

Statistical power should be estimated before data collection, and the power to detect an effect of a given size should be included in any report of results. If the null hypothesis is not rejected, reports of the power to detect an effect of a given size would enable the reader to judge the adequacy of the research to support the alternate hypothesis. Perhaps the most direct outcome of a heightened concern for statistical power is a realization that increases in sample sizes may warrant the greater cost and effort. Benefits of increased sample size include a greater likelihood of correctly rejecting a false null hypothesis and more accurate estimation of effect size.

However, a more subtle and perhaps more important implication of a concern with statistical power is an intensified focus on effect size. Both prior and posterior estimation of effect size should become more prevalent in marketing research. Estimates of effect size are especially important in highly powered research studies where even small effects can be statistically significant. This practice should become the norm. Literature reviews might concentrate on determining factors that led to a given effect size range. Theoretical and methodological improvements to increase likely effect size are the ideal outgrowth of an increased concern with statistical power.

REFERENCES

Armstrong, J. Scott (May 1979), "Advocacy and Objectivity in Science," Management Science, 25, pp. 423-438.

Bagozzi, Richard P. (May. 1977), "Structural Equation Models in Experimental Research," Journal of Marketing Research, 14, pp. 209-226.

Bakan, David (December 1966), "The Test of Significance in Psychological Research," Psychological Bulletin, 66, pp. 423-437.

Banks, Seymour (1965), Experimentation in Marketing (New York: McGraw-Hill Book Company).

Blattberg, Robert C., and Sen, Subrata R. (November 1973), "An Evaluation of the Application of Minimum Chi-Square Procedures to Stochastic Models of Brand Choice," Journal of Marketing Research, 10, pp. 421-427.

Boruch, Robert E., and Gomez, Hernando (November 1977), "Sensitivity, Bias end Theory in Impact Evaluations," Professional Psychology, pp. 411-434.

Carlsmith, J. Merrill, Ellsworth, Phoebe C., and Aronson, Elliot (1976), Methods of Research in Social Psychology (Reading, MA: Addison-Wesley).

Chevalier, Michel (November 1975), "Increase in Sales Due to In-Store Display," Journal of Marketing, 12, pp. 426-431.

Churchill, Gilbert A. (February 1979), "A Paradigm for Developing Better Measures of Marketing Constructs," Journal of Marketing Research, 16, pp. 64-73.

Cohen, Jacob (1965), "Some Statistical Issues in Psychological Research," in Handbook of Clinical Psychology, B. B. Wolman, (ed.), (New York: McGraw-Hill Book Company), pp. 95-121.

Cohen, Jacob (Summer 1973), "Statistical Power Analysis and Research Results," American Educational Research Journal, 10, pp. 225-229.

Cohen, Jacob (1977), Statistical Power Analysis for the Behavioral Sciences, (New York: Academic Press).

Cook, Thomas D., and Campbell, Donald T. (1979), Quasi-Experimentation: Design and Analysis Issues for Field Settings (Chicago: Rand-McNally).

Cook, Thomas D., Gruder, Charles R., Hennigan, Karen M., and Flay, Brian R. (1978), "History of the Sleeper Effect: Some Logical Pitfalls in Accepting the Null Hypothesis," Psychological Bulletin, 86, pp. 662-679.

Day, George S., and Heeler, Roger M. (August 1971), "Using Cluster Analysis to Improve Marketing Experiments," Journal of Marketing Research, 8, pp. 340-347.

Doyle, Peter, and Fenwick, Ian (February 1975), "Planning and Estimation in Advertising," Journal of Marketing Research, 12, pp. 1-6.

Eskin, Gerald J., and Baron, Penny H. (November 1977), "Effects of Price and Advertising in Test-Market Experiments," Journal of Marketing Research, 14, pp.499-508.

Fornell, Claes, and Larcker, David F. (February 1981), "Evaluating Structural Equation Models and Unobservable Variables and Measurement Error," Journal of Marketing Research, 18, pp. 39-50.

Givon, Moshe M., and Horsky, Dan (May 1979), "Application of a Composite Stochastic Model of BrAnd Choice," Journal of Marketing Research, 16, pp. 258-267.

Glass, Gene V., and Hakstian, A. Ralph (May 1969), "Measures of Association in Comparative Experiments: Their Development and Interpretation," American Educational Research Journal, 6, pp. 403-414.

Golden, Linda L. (November 1979), "Consumer Reactions to Explicit Brand Comparisons of Advertisements," Journal of Marketing Research, 16, pp. 517-532.

Green, Paul E. (November 1973), "On the Analysis of Interactions in Marketing Research Data," Journal of Marketing Research, 10, pp. 410-420.

Greenwald, Anthony G. (January 1975e), "Consequences of Prejudice Against the Null Hypothesis," Psychological --- Bulletin, 82, pp. 1-19.

Greenwald, Anthony G. (October 1975b), "Does the Good Samaritan Parable Increase Helping? A Comment on Darley and Batson's No-Effect Conclusion," Journal of Personality and Social Psychology, 32, pp. 578-583.

Greenwald, Anthony G. (March 1975c), "Significance, Non-significance, and Interpretation of an ESP Experiment," Journal of Experimental Social Psychology, 11, pp. 180-191.

Greenwald, Anthony G. (March 1976), "Within-Subject Designs: To Use or Not to Used' Psychological Bulletin, 83, pp. 314-320.

Gruder, Charles L., Cook, Thomas, Hennigan, Keren, Flay, Brian, Aessis, Cynthia, and Halamaj, Jerome (October 1978), "Empirical Tests of the Absolute Sleeper Effect predicted from the Discounting Cue Hypothesis," Journal of Personality and Social Psychology, 36, pp. 1061-1074.

Hawkins, Del (August 1970), "The Effects of Subliminal Stimulation on Drive Level and Brand Preference." Journal of Marketing Research, 7, pp. 322-326.

Hays, William L. (1963), Statistics for Psychologists, (New York: Holt, Rinehart, and Winston).

Hyde, Janet S. (August 1981), "How Large are Cognitive Gender Differences?: A Meta-Analysis Using w2 and d," American Psychologist, 36, pp. 892-901.

Jacoby, Jacob, Speller, Donald E., and Kohn, Carol A. (Feb. 1974a), "Brand Choice Behavior as a Function of Information Load," Journal of Marketing Research, 11, pp. 63-69.

Jacoby, Jacob (June 1974b), "Brand Choice Behavior as a Function of Information Load," Journal of Consumer Research, 1. pp.33-42.

Keppel, Geoffrey (1973), Design and Analysis: A Researcher's Handbook (Englewood Cliffs, NJ: Prentice-Hall, Inc.).

LaTour, Stephen A. (January 1981), "Effect Size Estimation A Commentary on Wolf and Bassler," Decision Sciences, 12, pp. 136-141.

Lodish, Leon rd, and Pekelman, Dov (August 1978), "Increasing Precision of Marketing Experiments by Matching Sales Areas," Journal of Marketing Research, 15, pp. 449-455.

Lutz, Richard J., and Kakkar, Pradeep (1975), "The Psychological Situation as a Determinant of Consumer Behavior," in Advances in Consumer Research, Vol. 2, Mary J. Schlinger, (ed.), (Chicago: Association for Consumer Research, pp. 439-453.

McAlister, Leigh (December 1979), "Choosing Multiple Items from a Product Class," Journal of Consumer Research, 6, pp. 213-224.

Meehl, Paul E. (June 1967), "Theory Testing in Psychology and Physics: A Methodological Paradox," Philosophy of Science, 16, pp. 103-115.

Morrison, Denton E., and Henkel, Ramon E. (eds.) (1970), The Significance Test Controversy (Chicago: Aldine).

Permut, Steven E., Michel, Allen J., and Joseph, Monica (August 1976), "The Researcher's Sample: A Review of the Choice of Respondents in Marketing Research," Journal of Marketing Research, 13, pp. 278-283.

Peter, Paul J. (May 1981), "Construct Validity: A Review of Basic Issues and Marketing Research," Journal of Marketing Research, 18.

Phillips, Lawrence D. (1973), Bayesian Statistics for Social Sciences (London: Thomas Nelson).

Phillips, Lynn W., and Bagozzi, Richard P. (1980), "On Measuring Organizational Properties: Methodological Issues in the Use of Key Informants," Unpublished Working Paper (Stanford University).

Platt, John R. (October 16, 1964), "Strong Inference," Science, 146, pp. 347-353.

Rosekrans, Frank M. (November 1969), "Statistical Significance and Reporting Test Results," Journal of Marketing Research, 6, pp. 451-455.

Rosenthal, Robert (December 1978), "Combining Results of Independent Studies," Psychological Bulletin, 85, pp. 185-193.

Rozeboom, William W. (September 1960), "The Fallacy of the Null Hypothesis Significance Test," Psychological Bulletin, 57, pp. 416-428.

Sawyer, Alan G. (March 1975), "Demand Artifacts in Laboratory Experiments in Consumer Research," Journal of Consumer Research, 1, pp. 20-30.

Sawyer, Alan G. and Ball, Dwayne (August 1981), "Statistical Power and Effect Size in Marketing Research," Journal of Marketing Research, 18, pp. 275-290.

Sawyer, Alan G., Worthing, Parker, and Sendak, Paul (Summer 19 "The Role of Laboratory Experiments to Test Marketing Strategies," Journal of Marketing, 43, pp. 60-67.

Sechrest, Lee, and Yeaton, William H. (1981a), "Empirical Bases for Estimating Effect Size," in Reanalyzing Program Evaluation: Policies and Practices for Seconder Analysis of Social and Educational Programs, R. F. Boruch, P. M. Wortman, and D. S. Cordray, (eds.) (San Francisco: Josses-Bass).

Sechrest, Lee and Yeaton, William H. (1981b), "Estimating Magnitudes of Experimental Effects," (Ann Arbor: Institute of Social Research, University of Michigan).

Shoemaker, Robert, and Staelin, Richard (May 1976), "The Effects of Sampling Variation on Sales Forecasts for New Consumer Product," Journal of Marketing Research, 13, pp. 138-143.

Srinivasan, V. (February 1977), "A Theoretical Comparison of the Predictive Power of the Multiple Regression an Equal Weighting Procedures," Research Paper No. 347 (Stanford University).

Srinivasan, V. and Kesavan, R. (September 1976), "An Alternate Interpretation of the Linear Learning Mode of Brand Choice," Journal of Consumer Research, 13, 76-83.

Sternthal, Brian, Dholakia, Ruby, and Leavitt, Clark (May 1978), "The Persuasive Effect of Source Credibility: Tests o, Cognitive Response,: Journal of Consumer Research, 4, pp. 252-260.

Sternthal, Brian, Phillips, Lynn W. and Dholakia, Ruby (Fall 1978), "The Persuasive Effect of Source Credibility: A Situational Analysis," Public Opinion Quarterly, 4 pp. 285-314.

Tucker, Ledyard R., and Levis, Charles (March 1975), "A Reliability Coefficient for Maximum Likelihood Factor Analysis," Psychometrika, 38, pp. 1-10.

Tversky, Amos, and Kahneman, Daniel (August 1971), "Belief in the Law of Small Numbers," Psychological Bulletin, 76. pp. 105-110.

Wheatley, John J. and Oshikawa, Sadaomi (February 1970), "The Relationship Between Anxiety and Positive and Ne Tucker, Advertising Appeals," Journal of Marketing Research, 7, pp. 85-89.

Winer, B. J. (1971), Statistical Principles in Experimental Design (New York: McGraw-Hill Book Company).

Winer, Russell S., end Ryan, Michael J. (November 1979), "Analyzing Cross-Classification Data: An Improved Method for Predicting Events," Journal of Marketing Research, 16, pp. 539-544.

----------------------------------------