The Impact of Program Evaluation Needs on Research Methodology

R. Bruce Hutton, University of Denver
Dennis L. McNeill, University of Denver
ABSTRACT - This paper presents a perspective on impact evaluation research that stresses the diagnostic role of evaluation in addition to the more common and-results evaluation. This leads to several implications for research--most notably that the research should require multiple methods to accurately diagnose program effects. An example of a recent program and the attendant methodological decisions illustrate the implications of the application of this perspective to evaluation research.
[ to cite ]:
R. Bruce Hutton and Dennis L. McNeill (1981) ,"The Impact of Program Evaluation Needs on Research Methodology", in NA - Advances in Consumer Research Volume 08, eds. Kent B. Monroe, Ann Abor, MI : Association for Consumer Research, Pages: 547-552.

Advances in Consumer Research Volume 8, 1981      Pages 547-552

THE IMPACT OF PROGRAM EVALUATION NEEDS ON RESEARCH METHODOLOGY

R. Bruce Hutton, University of Denver

Dennis L. McNeill, University of Denver

ABSTRACT -

This paper presents a perspective on impact evaluation research that stresses the diagnostic role of evaluation in addition to the more common and-results evaluation. This leads to several implications for research--most notably that the research should require multiple methods to accurately diagnose program effects. An example of a recent program and the attendant methodological decisions illustrate the implications of the application of this perspective to evaluation research.

INTRODUCTION

Recently an increased emphasis in program evaluation has emerged in the public sector. While several different forces have spurred this emphasis, the motivation comes principally from the growing dissatisfaction with seemingly ineffective programs (e.g. Howard and Antilla 1979) and the desire for a more systematic approach to policy decisions (e.g. Mazis and McNeill 1978).

However, the use of evaluation research by policy makers remains naive and is characterized by a justifiable hesitation as to the ability of research to evaluate complex social programs. Some of this hesitation comes from a lack of experience with research design, but there is s good portion of this reserve which is justified by the embryonic practice of program evaluation. There are inherent problems in the research field which revolve around the selection of a proper research design, development of valid dependent measures, use of appropriate analysis techniques, and drawing valid conclusions from the data. These problems are compounded by a lack of systematic feedback to the relevant disciplines regarding evaluation research that has been completed. However, as evaluation research becomes a larger part of the academic literature, it is clear that key methodological decisions have rendered many of the past evaluation efforts meaningless (see, for example, Phillips and Calder 1980).

This paper describes the design of a recant evaluation project (Hutton and McNeill 1980) which incorporated a multi-method approach to answering the questions about program impact. This approach has been suggested in the past (see Campbell and Flake 1959, and Heeler and Ray 1972) but has received virtually no support through implementation in evaluation research. Reasons for this oversight range from cost constraints to a myopic view of the evaluation problem.

THE EVALUATION PROBLEM

The literature on evaluation is replete with various terns describing evaluation typologies. Striven (1967) describes evaluation research as formative (pretesting or developmental research) and summative (measuring program effects) evaluation. Additionally, Freeman (1976) indicates a form of evaluation called process evaluation which investigates the procedures that were used to implement a policy program. Essentially, the research questions which can be answered by evaluation are of three generic types:

- End-Result Evaluation

Did the program accomplish the goal for which it was originally designed?

- Diagnostic Evaluation

Why did the program results occur?

- Formative Evaluation

What components of the program should be included?

It should be noted that evaluation research often occurs after the program goals and components have been determined. Consequently, the role of formative evaluation is likely to best aided by careful diagnosis of past efforts. As noted by Van Maanen (1979) much can be gained by findings with comparative groundings--those conclusions developed by contrasting programs with similar goals but different means of achieving these goals. Generically formative evaluation is different from the attempt to chronicle program effects through end-results or diagnostic research efforts.

End-result evaluation is the most common approach found in the design of evaluation studies, and, of the two tasks, is the easiest to operationalize. The end-result analysis requires a specification of the level of ultimate effect of the program (often behavior) and the design moves to that level once the program has had time to work. Clearly, goals of the program and timing of the results key the methodological decisions. Descriptive survey methods dominate the approach to this first task, although there is some use of quasi-experimental designs. However, these methods often suffer from an inherent inability to explain why results occurred (e.g., surveys) or from threats to internal validity in certain commonly used quasi-experimental designs. In this case the policy decision must then rely on post-hoc explanations of results which can never be fully defended.

Diagnostic evaluation is a different and more complex task. The goal of this research is to profile program impact in a way that understanding of the reasons for that impact can lead to adjustments in the current program and better program decisions in the future. The increased complexity of this type of research question requires the researcher to consider many more facets of program effectiveness that include the following:

- goals of the program;

- timing of program effects;

- timing of, and affects of individual program components;

- external factors which could influence program success or failure (e.g., conflicting programs already in place);

- individual difference variables in the population;

- the potential uses of the program results; and

- the time frame for future policy decisions.

The diagnostic evaluation process makes a more concentrated attempt to profile the environment in which the program must work and takes a more systematic approach to the evaluation than an end-result perspective. In addition, a diagnostic evaluation recognizes the limitation of a single methodology to answer all research questions.

The following sections will describe a case study of research design decisions which were intended to address both the diagnostic and end-result evaluation needs of a public policy program. In addition, results of the evaluation will be highlighted.

THE PROGRAM TO BE EVALUATED

The program evaluated was called the Low Cost/No Cost Energy Conservation Program (LC/NC) and was designed and implemented by the U.S. Department of Energy. The program was introduced in the six New England states in the fall of 1979 and consisted of the following components:

- A LC/NC booklet which described eleven categories of home energy conservation tips. These tips could be completed by the household for low or no cost and the eleven tips could save 25% of the home fuel costs in the winter.

- Paid advertising in radio, television, and newspapers. This activity consisted of professionally designed ads run in the New England region free November 8 to December 2, 1979.

- A showerflow control device given as an incentive with each LC/NC booklet, The device was a small plastic cone-shaped object which, when installed in a showerhead, restricts the flow of water (and use of hot water) from the shower. This was also one of the eleven LC/NC tips.

- Some public relations efforts to initially introduce the program to the region.

The focus of the program was the LC/NC booklet which was hand delivered with the showerflow device by the U.S. Postal Service to 4.5 million households in New England. All of the other program cosponsors were designed to inform the population about the program and to motivate the population--through the expected savings--to complete the LC/NC tips.

Several research questions were relevant to this program but two were prominent for the research methodology:

- Did the program stimulate LC/NC energy conservation behavior?

- Did the advertising contribute to the program success?

The first question is the end-result evaluation which essentially asks for a tally of energy conservation behaviors that resulted from the program. However, the timing of these behaviors is critical. The end of the year presents many competing demands on the consumer pocketbook and time, but due to the nature of future policy decisions results were necessary by January 1980.

The second major question which impacted on research revolved around the role of advertising in program success. Since paid advertising was a relatively new and controversial component of public programs, it was felt that the role of advertising in improving program effects should be documented. The difficulty now lay with matching methodology to evaluation needs.

MATCHING EVALUATION NEEDS WITH METHODOLOGY

The End-Results Evaluation

Table 1 presents the basic research questions and the multi-method designs which were used to gain diagnostic answers from program results.

The end-results evaluation posed some difficult problems in design. First, the after-only design (a survey) alone would not be able to provide a systematic basis for judging what the conservation behaviors would have been without the LC/NC program (Campbell and Stanley 1963).

TABLE 1

DESCRIPTION OF RESEARCH DESIGNS

Thus, both high success or failure would have been easily assailed without a basis for defense. Consequently, it was decided that a comparison group design be employed.

There are many critical decisions to make in the selection of a matched control condition for this evaluation, in order to make a proper selection and to avoid the negative artifacts of this process, the goal of the evaluation must be clarified. This evaluation must show whether the LC/NC program stimulated energy conservation behavior. Thus, the matching criteria must be those that would increase the probability of response to energy conservation (e.g. severe winters, dependency on expensive fuel, etc.). In other words, the control must be at least equally ready to respond to this program due to the need for these specific conservation behaviors. It was decided that New York, particularly the upstate area which included Rochester and Buffalo and rural Genessee county, matched New England well on several a priori characteristics:

- heating oil dependence;

- severity of winter;

- age and structure of homes; availability of previous conservation efforts; and

- demographic profile of the population.

However, because New York matched New England on the readiness to respond to conservation, the region also carried with it a considerable amount of conservation activities. These conservation activities were more prominent than New England, and, as the results were to show, caused New York to evidence more conservation behaviors prior to the program than the level found in New England. In a strict sense the groups were not matched on actual prior conservation behavior but were matched on the more critical dimensions which characterized the readiness of the region to respond to the LC/NC program. The unequal level of conservation activity cannot be ignored for it introduces some alternative hypotheses to the results that were measured after the program was in place. The alternate hypotheses affected both the analysis and the questionnaire design.

Three alternative hypotheses were most critical to this study: l) a different group hypothesis, 2) a calling effect hypothesis, and 3) an alternative causal element hypothesis.

The different groups hypothesis states that people who are currently eligible for LC/NC behaviors in New York and New England are not alike. Since the current proportion of the population completing LC/NC behavior in New York is higher, the easy to persuade segments of the population may have already been reached. Comparing the responses since the program inception is probably an unfair test due to the increased difficulty of persuading later adopters in New York (Shoemaker 1979).

The ceiling effect hypothesis operates in the same direction but for a different reason. A ceiling effect would suggest there exists some asymptotic level of behavioral response to energy conservation beyond which only minor changes in behavior are possible. Thus, this potential alternative hypothesis also makes inappropriate a comparison of the responses since the LC/NC program if the initial levels of response are not comparable. The changes in New York would he smaller by definition because these responses are closer to the ceiling.

The alternative causal element states that whatever caused New York to be high is now causing the behavioral responses in New England to change.

All of the above have implications that lead the analyst -to consider only the overall behavioral penetration levels in the test between the treatment (New England) and control (New York). If overall behavioral change is greater in New England than New York for the LC/NC tips, the alternative hypothesis become unlikely explanations for the results. The different groups hypothesis is not operable since New England and the LC/NC program would be able to better penetrate the hard-to-persuade segments. The ceiling effect hypothesis would be a poor alternative since the responses would have to exceed the hypothetical ceiling for New England to produce significant behavioral change. Finally, the alternative causal element would not be a likely explanation since the LC/NC program would produce changes greater than any hypothesized other causal element.

In the case where comparison groups cannot be equated on behavior prior to the program, the analysis then must be constrained to a test of overall behavior rather than behavior since the program. This does not overcome the matching problem but does eliminate the alternative hypothesis endemic to a pre-post comparison, and serves an additional benefit of providing a rather stringent test of program impact.

The lack of matching on prior levels of behavior has implications for the questionnaire development. The questionnaire must he developed in order to measure behavioral response to the LC/NC energy saving tips. However, because of the focus on behavior, the short duration between program introduction and measurement, and the lack of a perfect match on prior levels of LC/NC behaviors, the questionnaire content, as shown in Figure 1, must also address the following:

- the ability of respondents to do the LC/NC tips,

- the feeding of the LC/NC booklet,

- the timing of the completion of the tip,

- intentions for completion of the tip, and

- reasons for not intending to complete the LC/NC tips.

As in most studies particular attention must be given in the validity of dependent measures. In this case the concern is with the reliance upon self report behaviors, and the potential cue for "right" answers given the need to reference the LC/NC booklet. The validity of these responses was examined in two ways. The first is face validity following from the logical flow of the questions. For example, a yea-saying bias may push the respondent to say they have completed the LC/NC tip recently. If a subject said no, that the LC/NC tip had not been completed since the program, they were then asked if they saw the tip and intend to complete it later. A no answer to these two questions was followed by "why don't you intend to complete the tip?" It is here that respondents can indicate it has already been completed. In order to assess the validity of the timing of the completion of the tip, a percentage of the respondents should have followed the above sequence to indicate that the tip had been done before the program. On the face of the logic of these patterns of responses the researcher has some increased confidence in the validity of the timing of the behavior.

However, this does not address the presence of a bias to report completion of the tip when this is not the case. This was handled by a second phone call three weeks following the initial interviews to a sample of households to test the responses to key LC/NC tips. In a sample of 180 households in the treatment and control condition, 96% of the original responses were confirmed, and the remaining 4% were explained by an inability to query a knowledgeable member of the household regarding the LC/NC tips.

The final methodological consideration involves sampling in both the treatment and control condition, a probability sample of 1207 and 604 in New England and New York

FIGURE 1

THE COMPONENTS OF THE LC/NC QUESTIONNAIRE

respectively was prepared from telephone exchanges in two strata--big cities, and non-metropolitan areas. The method included unlisted numbers and new listings in their correct proportion.

The Diagnostic Evaluation

The diagnostic evaluation focused on the contribution of paid television advertising and resulted in two separate studies being completed. The first answered the question of the relationship of exposure to LC/NC television ads on the impact of the LC/NC program. The second study addressed the issue of the potential of the LC/NC ads to aid program impact.

The Relationship of the LC/NC Ads to Program Impact.  This study had considerable importance to this program and to others desiring to use paid advertising as a part of an integrated policy program. There were three specific goals to this aspect of the evaluation. The first was to describe the relationship between the exposure to the ads and program results. The secondary goals of this study were to examine the efficacy of the media-buying decisions and to serve as cross validation for the end-results study. The key issue for the first goal was to implement a methodology whereby exposure to the LC/NC ads could be unobtrusively measured and related to the overall program impact. In order to accomplish this, a sample of households in the greater Boston metropolitan area were selected from an existing mail panel to keep a diary of their adult TV watching behavior for a period November 8 to December 2, 1979. This period corresponds to the period when paid television advertising was in operation. A sample of the diary format is found in Figure 2. An original sample of 250 households was selected and-by December 15 154 (61.6%) households had mailed in the set of four completed diaries.

The task then was to code the diaries so that the television watching could be matched to the times when the LC/NC ads were shown. The relationship to the LC/NC program effects was accomplished by a telephone interview using the questionnaire from the end-results evaluation. This questionnaire was administered to 92 households from December 15 to December 24. The exposure to the LC/NC ads was then related to the impact of the LC/NC program effects.

FIGURE 2

(SAMPLE FORMAT) - TV VIEWING QUESTIONNAIRE FOR SATURDAY

Several important considerations need to be addressed with this design. First, the amount of TV watching is likely to be correlated with LC/NC ad exposure. However, the key goal of this design is the assessment of increased "real world" exposure to the ads on LC/NC conservation tips. Thus, while the design cannot causally link increase repetition with program effects, it can: 1) describe the pattern of behaviors and ad exposure found in the sample population; and 2) through careful analysis constrain the ability of alternative hypotheses to explain results. In addition, the correlation between ad exposures and LC/NC behaviors is most likely to be explained by overall television watching behavior in cases Where ad exposure and LC/NC behavior are linearly related.

As mentioned earlier there are two secondary goals for this study. The exposure to the ads can confirm the advertising buying decision used in this program. The advertising was purchased on the basis of cumulative gross rating points that indicated 81.5% of the population would be exposed to LC/NC ads at least 12 times. The pattern of exposure of this panel to the ads can confirm (or disconfirm) the efficacy of this media buying decision.

Further the results of the survey can serve as a comparison to the end-results evaluation methodology in terms of the penetration of the LC/NC behaviors.

The Memorability of the LC/NC Ads.  In order to determine the role of paid advertising in the effects of the LC/NC program, it is necessary to rule out an alternative hypothesis that the ads cannot contribute due to poor communication effectiveness. This essentially asks "Do the LC/NC ads have the communication potential to contribute?"

This task was accomplished by testing the recall of the LC/NC ads with a random sample of commercial viewers the day after the first exposure to the ads. These tests were run on November 9 in Boston with a sample from each of the two versions of the LC/NC ad. The sample (n = 305) was screened to make sure they were in the commercial audience, and were then asked to recall the commercial and its messages. The commercial recall is then compared to industry norms for 30 second product commercials, and, in addition, the messages recalled were compared to the intent of the LC/NC program.

Thus, the diagnostic goals approach has resulted in a multi-method approach to the evaluation of the LC/NC program. This multi-method design has the advantages of removing much of the need for post-hoc guesses as in the reason for program effects and, as the next section indicates, the results provide a more complete profile of program impact.

RELATING RESEARCH OUTPUT TO THE RESEARCH QUESTIONS

As stated earlier two research questions guided the research design. Presented below are several highlights of the results of the diagnostic evaluation of the LC/NC program.

What were the program effects?

- Readership of the LC/NC booklet was high (71.1% of the sample in New England).

- Four out of eleven LC/NC energy conservation tips evidenced significant (p<.05) improvement when comparing New England to New York.

- New England had completed more LC/NC energy saving tips than New York (New England 4.63 vs. New York 3.96; t = 5.0, p <.001).

- The biggest predictor of the degree of response to the program was the amount of prior behavior. Those with the greatest response to the program had completed significantly fewer LC/NC tips prior to the program.

The results of the second research question yielded similar diagnostic findings:

What was the relationship of paid television advertising to program effects?

- The recall study indicated the LC/NC ads communicated as well as the normal 30 second commercial.

- The 25% savings claim was recalled as a poor third selling point. This took on added meaning when it was found that those believing this claim did more LC/NC tips.

- Advertising exposure was related to enhanced program effects when increasing exposure from very low levels (0 to 1 exposure per week) to 1 to 2 exposures per week. However, further increases in exposure were found to be a significant decrease in LC/NC behaviors. This raises the specter of an early wear out for this type of ad, and questions the use of a decision rule which is founded in a "more the better" philosophy for the use of advertising in conservation programs.

- Only 9% of the males and 22% of the females reached the level of exposure that the cumulative gross rating point media-buy decision rule had indicated.

The above are highlights of the results of the LC/NC impact evaluation which attempted to empirically diagnose as well as evaluate the effects of the program. Several implications emanate from this example of perspective on evaluation.

IMPLICATIONS

Evaluation research is not a discipline but rather a conglomeration of applied scientists attempting to provide empirical data to policy makers regarding the effects of a public program. This lack of organization results in a very complex and important area receiving little attention and almost no systematic feedback to the impacted disciplines.

In addition, for several reasons the researcher cannot go to the policy maker for guidance. The policy maker does not have adequate training to judge research methodology and policy makers are, at the very least, hesitantly supportive of evaluation. The implication of the two previous points places a great burden on the internal standards of the researcher.

A number of factors make the decisions in research methodology difficult ones. Several of these factors are attendant to the policy programs themselves. The timing of the planning for research is often short due to the fact that evaluation is generally one of the later program decisions. This often precludes taking premeasures and places a constraint on the time for planning of the research design. In addition, decisions for future policy often have to be made before the research is completed as the research often fails to play its proper role in program planning. Also, the nature of policy programs often preclude the development of a proper control condition or make difficult the identification of properly matched groups. Finally, the orientation of evaluators is heavily toward survey-type designs and to single-method evaluation studies. Both of the latter are severely limiting decisions.

There are clear advantages to attempting causal designs in establishing program effectiveness, ruling out alternative hypotheses, and removing the need for post-hoc theorizing in explaining results. Of equal significance is the need for multiple-methodologies to fully profile program effects. The most common perspective to evaluation is to provide end-results data. However, this ignores the fact that programs are part of an on-going activity which needs to understand program effects as well as document them. This highlights the inherent deficiency of any single study to measure impacts as well as diagnose the results for understanding. But, this often means that the individual studies undertaken may, when taken alone, evidence deficiencies. These deficiencies, however, must be taken in light of the complete profile of the evaluation effort.

This paper has presented the methodological decisions of a recent evaluation effort for the LC/NC program. In this evaluation multiple-methods were used to both strengthen the conclusions of program effects through convergence of the results of the various studies, and to provide diagnosis so that future programs could benefit from the analysis of the LC/NC tactical decisions.

REFERENCES

Campbell, Donald T. and Fiske, Donald W. (1959), "Convergent and Discriminant Validation by the Multi-Trait Multi-Method Matrix," Psychological Bulletin, 56, p. 81-105.

Campbell, Donald T. and Stanley, Julian C. (1963). Experimental and Quasi-Experimental Designs for Research, Chicago, Rand McNally.

Freeman, Howard E., "The Present Status of Evaluation Research," Evaluation Studies Review Annual, p. 17-51.

Heeler, Roger M. and Ray, Michael L. (1972), "Measure Validation in Marketing," Journal of Marketing Research, 9, p. 361-370.

Howard, Niles and Antilla, Susan (1979), editorial, Duns Review, September, p. 49-57.

Hutton, R. Bruce and McNeill, Dennis L. (1980), "An Empirical Evaluation of the Low Cost/No Cost Energy Conservation Progress: A Draft, U.S. Department of Energy, Washington, D.C.

Mazis, Michael G. and McNeill, Dennis L. (1978), "The Use of Marketing Research in FTC Decision-Making," in S. C. Jain (ed.) Research Frontiers in Marketing: Dialogues and Directions, American Marketing Association, Chicago.

Phillips, Lynn W. and Calder, Bobby J. (1979), "Evaluating Consumer Protection Programs: Part I. Weak but Commonly Used Research Designs," Journal of Consumer Affairs, Winter, 13, p. 157-185.

Striven, M. (1967), "The Methodology of Evaluation," in R. E. Stake (ed.) AERA Monograph Series on Curriculum Evaluation, 1, p. 39-83.

Shoemaker, Floyd R. (1979), "Diffusion and the Marketing Of New Products," presentation handout, University of Denver, Denver, Colorado, February 15, 1979.

Van Maanen, John (1979), "The Process of Program Evaluation," The Grantsmanship Center News, January/February, p. 29-74.

----------------------------------------