Research Methodology and Advertising Substantiation

Robert A. Mittelstaedt, University of Nebraska (student) University of Nebraska
Nils-Erik Aaby,
ABSTRACT - As more consumer research is used to substantiate advertising claims in FTC proceedings, methodological questions become important. Full utilization of research results depends on correctly interpreting the errors (Type I and Type II) inherent in accepting or rejecting hypotheses and some estimate of the relative costs associated with those errors.
[ to cite ]:
Robert A. Mittelstaedt and Nils-Erik Aaby (1976) ,"Research Methodology and Advertising Substantiation", in NA - Advances in Consumer Research Volume 03, eds. Beverlee B. Anderson, Cincinnati, OH : Association for Consumer Research, Pages: 58-62.

Advances in Consumer Research Volume 3, 1976      Pages 58-62


Robert A. Mittelstaedt, University of Nebraska

Nils-Erik Aaby (student) University of Nebraska


As more consumer research is used to substantiate advertising claims in FTC proceedings, methodological questions become important. Full utilization of research results depends on correctly interpreting the errors (Type I and Type II) inherent in accepting or rejecting hypotheses and some estimate of the relative costs associated with those errors.


The use of research results to substantiate advertising claims which have been challenged as deceptive is not new. However, the Federal Trade Commission has been formalizing this process through its advertising substantiation program and the relationship between research and product claims has taken on several new aspects.

First, many more claims are being scrutinized and the frequency of calls for substantiation has increased. Second, under the new "streamlined and tightened" rules of January, 1974, the response time has been shortened from 60 to 30 days (FTC News Summary, 1974a) with a proposal to exclude from adjudicative proceedings any material arriving after the deadline (FTC News Summary, 1974b). No doubt this reflects recent Commission decisions that research supporting a claim must be performed prior to the making of a claim (Pfizer, 1972).

Third, the type of research that would be acceptable to support a given claim appears to have been broadened. In 1972 the Commission defined a "scientific test" as follows:

In our view a scientific test Es one in which persons with skill and expertise in the field conduct the test and evaluate its results in a disinterested manner using testing procedures generally accepted in the profession which best ensure accurate results. This is not to say that respondent always must conduct laboratory tests. The appropriate test depends on the nature of the claim made. Thus a road or user test may be an adequate scientific test to substantiate one performance claim, whereas a laboratory test may be the proper test to substantiate another claim. Respondent's obligation is to assure that any claim it makes is adequately substantiated by the results of whatever constitutes a scientific test in those circumstances (Firestone, 1972, 463).

More recently, the Commission appears to have put the "hearsay issue," which has long been a problem when using consumer research as legal evidence, to rest:

Turning to the test marketing reports in this record, we must dismiss any contention that the FTC is bound to reject these consumer surveys as inadmissible hearsay. The Commission has on numerous occasions considered the question of the admissibility of surveys which are obviously hearsay, and it is well settled that such surveys will be admitted for the truth of the matters asserted when it is demonstrated that they are reasonably reliable and probative (Bristol-Myers, 1975).

Further, the Commission has used "academic" research as an input into a decision regarding a proposed ban on premium ads on children's television programs (Advertising Age, 1975).

Finally, while the Commission has apparently broadened its definition of scientific evidence, it has given considerably more attention to methodological details. (Alberto-Culver, 1973; Gillette, 1973) For example, the Commission ordered the Gillette Company to substantiate a claim that a deodorant had "A light, clean scent. Not perfumey or chemical. A scent that comes from real, natural ingredients. Not a lot of artificial ones" (Gillette, 1973) by documentation that was to include, but not be limited to (Gillette, 1973):

A. A definition of such terms as "light," "clean," etc.

B. Survey or experimental evidence that a substantial majority of individuals perceived Right Guard as light, clean, etc. to include:

1. complete explanation of methodology.

2. description of subjects, method of recruitment, size and generalizability of sample.

3. experience and training of experimenters.

4. list and descriptions of products compared.

5. survey respondents' knowledge of purpose.

6. statistical designs and techniques of statistical inference including assumptions necessary for tests of significance and specification of alpha level.

7. data necessary for replication.

8. copies of all questionnaires.

9. indication of reliability and validity of all instruments used.

C. List of ingredients of product, by percentage, and their function.

All of this suggests that in cases involving the substantiation of advertising claims, research of the sort familiar to consumer researchers will play an increasingly important role. It further suggests that the Commission, as it seeks to specify certain methodological matters, will be less able to reject research evidence as "not germane" and the resulting evidentiary issues will increasingly turn on methodological questions. The purpose of this paper is to discuss some of the logical, methodological and legal issues involved in using research as evidence in cases involving the substantiation of advertising claims.


Assume Firm "X" wishes to claim that its product "A" is better than "B." In this example, "B" may be a previous product, a competitive product or some established standard. Regardless of the referent, the claim is essentially comparative and presumably verified by showing "A" to be "better than 'B'" along some measurable dimension(s). If Firm "X" makes such a claim in its advertising, the present interpretation of the substantiation program is that "X" must have some basis for the claim, presumably research results. If called upon to substantiate, the respondent ("X") submits its research and the Commission decides whether or not it supports the claim.

At this point, there are at least two problems, one conceptual and the other legal, which will be noted but not discussed. First, one may question whether the truth of a specific claim is really the central issue in determining "deception," an issue discussed elsewhere (Developments, 1967; Gardner, 1975). Second, the whole procedure may be viewed as shifting the burden of proof to the advertiser, which some authorities view as the "fatal flaw" in the entire ad substantiation program (Gellerhorn 1969, 1971; Rosden, 1974). However, the Commission is requiring research results from advertisers using it to determine the existence of deception.

Whether an advertisement is analyzed from the standpoint of unfairness or deception, however, the standard for evaluating the substantiating material and the test which is applied is the same -- does the substantiation provide a reasonable basis to support the claim. Essentially, this is a factual issue to be formulated in the context of circumstances present in each case (National Dynamics, 1973, 549).

Given this, two general areas, measurement and statistical inference, are discussed before attention is turned to the matter of research evidence in the larger context of the Commission's decision making process.


No doubt the dimensions along which "A" is supposedly better than "B" will determine the type of measurement used co support the claim. In a general sense three situations seems applicable. First, the measurement may be "physical" or "mechanical," e. g., "A" tires stop faster than "B" or it takes greater pressure to break "A" than "B." Second, the measurements may consist of the observations of some "expert" (the typical clinical evidence), e.g., "A" produces fewer cavities or relieves some set of symptoms better than "B." Third, the perceptions, opinions or attitudes of consumers may be measured, e.g., "A" smells fresher or tastes better than "B."

While all of these measurements lead to some "numbers" and, presumably, to a "test statistic," there are a few differences among these three situations worth noting. In general, the problem of finding a measure which both Firm "X" and the FTC agree is pertinent to the claim is probably most acute in the third case (consumer research). The consumer research case is also most likely Co involve any possible issues of "hearsay evidence" that may remain, in addition to the problem of concealing the identity of respondents and the potentially conflicting goals of "full disclosure" and "professional ethics." Furthermore, the nature of the data generated in consumer, and possibly clinical, research will be more likely to lend themselves to analysis by non-parametric methods which are generally less powerful. (The issue of "power" will be discussed later.) Finally, while not exactly a measurement issue, each of the cases described above has its own peculiar sampling problems of defining the relevant "universe" and actually achieving a random selection process.


Formed as a research question, the support for the claim that "'A' is better than 'B'" would most likely rest on a test capable of rejecting the null hypothesis that "'A' is not better than (or equal co) 'B'." in favor of the alternative. From the data, a test statistic is computed (t, F, Chi-square, etc.) and, if its magnitude indicates that the test results are sufficiently unlikely under the assumption of the null hypothesis, the null is rejected and the result is said to be significant. Of course, unlikely things do occur and there is a chance of wrongly rejecting the null hypothesis (Type I error) or, if non-significant, a chance of wrongly accepting the null (Type II error.) Type I error is "controllable" by the researcher through the selection of the alpha level and "good practice" usually dictates that it is set in advance of the test and at a fairly low, conventional level. Controlling Type II error (beta)is also possible, but more difficult, since it can be computed only against a specified and definite alternative.

If the Commission accepts the results of the test as establishing substantiation when the null hypothesis is rejected (the results are significant) and failing to establish substantiation when the null is not rejected (the results are not significant), they assume the "risk burden" of each type of error in their decision. Table 1 shows this schematically. If the null hypothesis is rejected and the Commission concludes the claim is valid, the Commission runs the risk (p=alpha) of allowing a false claim to stay on the market. On the other hand, if the test is non-significant and the Commission concludes that the ad is unsubstantiated and must be removed, they risk (p=beta) taking a true claim off the market.


Therefore, using research findings as major inputs to the Commission's decision making in cases involving the substantiation of advertising claims raises at least two issues: the interpretation of statistical significance and the relative costs of Type I and Type II errors.


As anyone who has ever taught or taken a statistics course knows, the phrase "statistical significance" has a rather special meaning. As Hays puts it:

. . . all that a significant result implies is chat one has observed something relatively unlikely given the hypothetical situation, but relatively more likely given some alternative situation .... Statistical significance is a statement about the likelihood of the observed result, nothing else. It does not guarantee that something important, or even meaningful, has happened (Hays, 1973, 384).

There appear to be few cases in which the Commission or the Courts have confronted the necessity of interpreting a "statistically significant result" or "non-significance," although some substantiation data appear to lack significance (Aaby, 1975). Such an interpretation should recognize both the probabilistic nature of any conclusion drawn and the matter of "effect size."

Looking first at the probabilistic nature of any conclusion drawn from a statistical test, the Commission and the Courts seem to have taken the results of particular tests as "fact." In Pfizer (1972), an expert witness attested to the statistical significance of the results of an efficacy test of a sunburn remedy and the administrative law judge accepted the claim that the product "worked." The Commission skirted the issue of interpretation, but before rejecting the finding as evidence (because the performance of the test had not preceded the making of the claim) noted:

The nature and intricacy of the debate on the adequacy of this test leads to the view that the Commission's role should simply be one of attempting to determine the existence and general quality of the tests and a threshold determination as to the reasonableness of reliance thereon, rather than an attempt to conclusively determine the adequacy of the tests (Pfizer, 1972, 67).

In two earlier cases the Commission and, eventually, the Courts were faced with results which were "in the right direction" but failed to achieve statistical significance. In FTC vs. Country Tweeds (1964), the failure of the respondent firm to mention that test results reproduced for promotional purposes had been "insignificant" was held to be misleading by the FTC and specifically upheld by the court. In FTC vs. Sterling Drug (1963), the issue was also what the advertiser had said about a non-significant finding. At issue were the data reproduced in Table 2.



The ad in question had conceded that there were no significant differences among the products but went on to say "Nevertheless it is interesting to note that within just 15 minutes Bayer Aspirin had a somewhat higher pain relief score than any of the other products." Although the FTC objected to this statement, the Court found the statement "literally true" and went on to observe:

The fact that the margin of accuracy of the scoring system was .124 -- meaning that the second place drug might fare as well or better over the long run of statistical tests -- does not detract from the fact that on this particular test, Bayer apparently fared better than any other product in relieving pain within fifteen minutes after its administration (FTC vs. Sterling Drug, 1962, 677).

In short, the Commission accepted the lack of significance as establishing "no difference" as fact while the Court seemed to recognize the probabilistic nature of the conclusion but focused on the specific instance reported in the data. However, the data in Table 1 raise two further questions, both related to effect size and the power of the statistical test employed.

The ability of a test to reject a null hypothesis when it is, in fact, false (i.e., the test's power) is a function of sample size, effect size (the real difference) and the level of alpha (Cohen, 1969). Assuming that there is some effect (Bayer really is modestly more effective in fifteen minutes) and the level of alpha fixed at .05 (by convention or decree) there is, presumably, some sample size sufficiently large to produce a "statistically significant result." If the Commission's case rests upon the failure to reject a null hypothesis at some given level of alpha, while the respondent's case rests on rejection, the question of sample size becomes a point of contention between the two. Unless some attention is given to the question of how much "A" must be better than "B," in an adversary proceeding the Commission would maximize the number of its enforcement orders by insisting on small samples while the respondents would always prefer to defend with large samples. When the advertiser has made a specific claim ("'A' is 25% better than 'B'") the alternative hypothesis is clear. However, in the presence of an indefinite alternative ("'A' is better than 'B'"), the search for "truth" may devolve into a battle over semantics or among statisticians (Advertising Age, 1974). The FTC and the Courts have recognized that differences may not be practically significant (Lorillard, 1950) but do not appear to have related this to the issue of a test's power.

Also involved is the relationship between the alpha level and the power of a test. Other factors (effect size and sample size) equal, a test's power diminishes as the size of alpha decreases. Put another way, as the probability of making a Type I error is decreased, the probability of making a Type II error increases. If there are costs associated with each error type, the problem goes beyond the statistical issue to the decision making framework of the Commission.


In the final analysis, the consumer's interest is not in the confirmation or disconfirmation of a statistical hypothesis hut in the "truth" of a given claim. Ultimately, the statistical "result" is a probabilistic statement and, as such, is evidence but not proof. However, by relying on sample evidence from a properly conducted study, the risks of erroneously allowing a false claim to stand (Type I error) and erroneously removing a true claim (Type II error) are estimable. It also seems plausible that the disutility to consumers associated with either of these errors will vary from situation to situation.

Kaplan (1968) has provided a framework for evaluating the degree of certainty which evidence must possess to make a decision, taking the costs of "wrong decisions" into account. Expressed as an inequality:



P = the degree of certainty needed to convict or find for the plaintiff.

Dg = the disutility of allowing the guilty to go free (i.e., letting a false ad claim stand).

Di = the disutility of convicting the innocent, (i.e., removing a true ad claim from the market).

Kaplan observes that in most civil actions ("X" sues "y") the disutilities of erroneously finding for the plaintiff or the defendant are most likely equal (Dg/Di = 1.0). If so, P becomes .5 and any "preponderance of evidence" for the plaintiff (i.e., there is more reason to believe the plaintiff is right) should be sufficient to find for him. However, in criminal cases, because we hold to the general belief that it is better for the guilty to go free than for the innocent to be convicted (Dg/Di < 1.0) the weight of evidence needed to convict (expressed as the probability of belief, P) should be much closer to 1.0. This is especially true when the penalty is severe (e.g., capital punishment).

In Pfizer (1972), the FTC appeared to believe that the disutility of relying on a false claim (Dg) fell exclusively on the consumer while the disutility associated with Type II error (Di) fell only on the producer and, therefore Dg was always larger than Di. While not completely rejecting that notion (at least with respect to Di), the recently announced dismissal of the order against "Dry-Ban" seems to indicate that the Commission recognizes that Dg will vary from case to case (Bristol-Meyers, 1975). Specifically, that decision implies that greater Dg is experienced when the audience is particularly vulnerable, when health and safety claims are involved and when serious economic loss results from reliance on a false claim. Further, the decision recognizes that injury to competition is a factor involved in estimating the magnitude of Dg.

The losses associated with erroneously removing a true claim from the market (Di) are less easily recognized but no less real. If product "A" is truly better than "B," but the producer of "A" is forced to stop claiming its superiority, the losses to the firm are obvious --lost sales and the costs of presenting a defense. However, at least two other disutilities would result. First, consumers would be denied access to potentially useful information (assuming the claim is not only valid but relevant). It should be noted that in a recent case (Crown Central Petroleum, 1974) the Commission explicitly chose not to consider whether a 10% reduction in auto emissions had "social value" and decided the case on the narrower grounds that the gasoline in question didn't produce "clean air" as claimed. A second disutility is involved if the Commission uses some of its limited "regulatory resources" to achieve a faulty goal, resulting in an opportunity loss associated with the consequent failure to pursue some other deception (Mittelstaedt, 1972). But, however the magnitude of Di is estimated and whatever it includes, the imposition of the "ultimate penalty," corrective advertising, always increases Di relative to Dg.

The point is that there are disutilities involved in both types of error. The public is less well off when a false ad is allowed to stand or when a true ad is removed. While these disutilities are difficult to quantify, only the ratio of Dg/Di is really important and subjective estimates will probably have to suffice.


In the decision process, both the disutilities and probabilities of each type of error are important and this brings us back to the question of rejecting null hypotheses. Two possible situations seem worthy of comment.

First, assume that a firm has been claiming that its product "A" is better than "B." As substantiation, it has a research study which it and the FTC agree is a "fair test" and which rejects the null hypothesis ("A" is no better than "B") at the .05 level. On the basis of this test alone the Commission cannot know if the claim is really true but, if they accept this finding as fact, they are taking a known chance (.05) of allowing a false claim to stand. Kaplan's model suggests that the Commission should rely on the results of the test unless Dg exceeds Di by a very wide margin (19 to 1). In cases where the costs of allowing the false claim to stand are relatively large (e.g., safety claims), a smaller level of alpha (e.g., .01) may be appropriate.

Second, assume the test's results fail to reject the null hypothesis that "A" is no better than "B." Here the proper action is less clear for the Commission's "case" rests on the acceptance of the null hypothesis as "fact." The researcher faced with non-significant results is usually told by the statistician to follow the Fisherian advice and "suspend judgment." This advice is sound when the alternative hypothesis is indefinite and the costs of error of either type are unknown or small (Hays, 1973, 355-357). While researchers may be able to suspend judgment, the Commission probably can't (or won't). But, if the power of the test used in the research can be calculated, which rests on the ability to specify some meaningful alternative hypothesis, the risks of Type II error can be compared with the relative disutilities of Dg and Di. As the ratio (Dg/Di) grows smaller, the question of the Cast's power becomes increasingly important. Since "corrective advertising" increases Di relative to Dg, the power of the test should be a critical determination before this remedy is applied.

Increasingly, studies of the sort familiar to consumer researchers will be used to substantiate advertising claims. As researchers (and consumers) we have a stake in seeing that something positive results from the effort. Questions of measurement validity and the generalizability of samples will, and should, continue to engage our attention but, in the end, research results must be interpreted as evidence for or against specific claims. Full utilization of any study's information in the regulatory process requires an understanding of the probabilistic nature of research results. To focus on Type I error alone will not insure correct interpretation. What is needed is an understanding of both Type I and Type II errors and an appreciation of the relative costs of each.


Nils-Erik Aaby, "American Comparative Advertising: The Problem of Substantiation," paper presented at Annual Meeting of Southwestern Marketing Association, Houston (March 8, 1975).

Advertising Age, 16 December 1974, 14.

Advertising Age, 12 May 1975, 32.

Alberto-Culver Co., et. al., 3 Trade Reg. Rep., &-20,357 (1973).

Bristol-Myers Co., et. al., 3 Trade Reg. Rep.,& -20,900 (1975)..

Jacob Cohen, Statistical Power Analysis for the Behavioral Sciences (New York: Academic Press, 1969).

Crown Central Petroleum Corp., 3 Trade. Reg. Rep., & -20,790 (1974).

Thomas J. Dekornfeld, Louis Lasagna and Todd M. Frazier, "A Comparative Study of Five Proprietary Analgesic Compounds," Journal of American Medical Association, 182(December 29, 1962), 1315-1318.

Developments in the Law, "Deceptive Advertising," Harvard Law Review, 80 (March, 1967), 925-1164.

Firestone Tire and Rubber Co., 81 FTC 463 (1972).

FTC News Summary (no. 2), 16-31 January, 1974(a), 1.

FTC News Summary (no. 10), 16-31 May 1974(b), 1.

FTC News Summary (no. 23), 30 May 1975, 1.

FTC vs. Country Tweeds, 326 F 2d. 144 (1964).

FTC vs. P. Lorillard, 186 F 2d. 52 (1950).

FTC vs. Sterling Drug, et. al., 317 F 2d. 669 (1963).

David M. Gardner, "Deception in Advertising: A Conceptual Approach," Journal of Marketing, 39(January, 1975), 40-46.

Ernest Gellerhorn, "Proof of Consumer Deception Before Federal Trade Commission," University of Kansas Law Review, 17(1969), 559-585.

Ernest Gellerhorn, "Rules of Evidence and Official Notice in Formal Administrative Hearings," Duke Law Journal, 1(1971), 1-50.

Gillette Co., et. al., 3 Trade Reg. Rep., & 20,340 (1973).

William L. Hays, Statistics for the Social Sciences, (New York: Holt, Rinehart and Winston, 1973).

John Kaplan, "Decision Theory and the Factfinding Process," Stanford Law Review, 20 (June, 1968), 1065-1092.

Robert A. Mittelstaedt, "Consumer Protection and the Value of Information," in M. Venkatesan (ed.), Proceedings of the Third Annual Conference of the Association for Consumer Research, (1972), 101-106.

National Dynamics Corporation, et. al., 82 FTC 488 (1973).

Pfizer, Inc., 81 FTC 23 (1972).

Richard A. Posner, Regulation of Advertising by the FTC, (Washington, D. C.: American Enterprise Institute for Public Policy Research, 1973).

George Eric Rosden, The Law of Advertising (vol. 2), (New York: Mathew Bender, 1974).