A Review of Meta-Analytic Techniques

David Brinberg, Baruch College, CUNY James Jaccard, SUNY - Albany
ABSTRACT - A number of statistical procedures have been developed in recent years to integrate a set of empirical findings. This paper provides a review and synthesis of the strengths and weaknesses of these techniques.
[ to cite ]:
David Brinberg, Baruch College, CUNY James Jaccard (1986) ,"A Review of Meta-Analytic Techniques", in NA - Advances in Consumer Research Volume 13, eds. Richard J. Lutz, Provo, UT : Association for Consumer Research, Pages: 606-611.

Advances in Consumer Research Volume 13, 1986      Pages 606-611

A REVIEW OF META-ANALYTIC TECHNIQUES

David Brinberg, Baruch College, CUNY

James Jaccard, SUNY - Albany

[This paper is a condensed version of a chapter entitled "Meta-Analysis Techniques for the quantitative integration of research findings" to appear in D. Brinberg & R.J. Lutz (Eds.) Perspectives on Metholodology in Consumer Research, New York: Springer-Verlag.]

ABSTRACT -

A number of statistical procedures have been developed in recent years to integrate a set of empirical findings. This paper provides a review and synthesis of the strengths and weaknesses of these techniques.

INTRODUCTION

With the rapid growth of empirical findings in consumer behavior, there is a need to integrate research results into a coherent body of knowledge. In this paper, we will discuss an approach that can provide the researcher with the tools to integrate a body of literature. The paper will be divided into two sections. In section one, we will present a brief historic l overview of the procedures used to integrate research findings. in section two, we will discuss the types of statistic l procedures that can be used for the integration of research findings and we will present examples of the use of these procedures.

Integrating Research Findings: A Brief Historical Review

Statistical procedures for the integration of research findings have been available to social science researchers for over 50 years (cf. Pearson 1933; Pearson 1938; Wallis 1942; Fisher 1948; Mosteller & Bush 1954). Those procedures focused primarily on the development of transformations (e.g., natural log, z score transformation) and strategies for weighting individual studies and allow the researcher to combine or to compare the probability levels obtained from independent studies. Recent work on these statistical procedures-(e.g., Rosenthal 1978; Glass, McGaw and Smith 1981; Hunter, Schmidt, and Jackson 1982; Hedges 1982, 1983) has extended the work on transformations and has focused on the estimation of the nature and strength of a hypothesized relation. Meta-analysis has been the tens developed to describe the u e of these statistical procedures for the integration of a set of empirical findings across independent studies (Glass, McGaw and Smith 19815

In recent years, meta-analytic procedures have been introduced to consumer researchers. Sawyer and Peter (1983) discuss issues associated with meta-analysis in their presentation of statistical significance testing. Substantive applications of meta-analysis have examined the relation between research design and response rates to questionnaires (Yu and Cooper 1983); research design and the reliability of rating scales (Churchill and Peter 1983); and advertising and sales (Assmus, Farley, and Lehmann 1984). Several researchers have discussed the strengths and weaknesses of meta-analysis (Houston, Peter, and Sawyer 1983; Monroe and Krishnan 1983), although those authors to not provide a discussion of the analytic procedures. in section two of this paper, we will focus on a detailed discussion of these procedures.

Limitations of Literature Reviews

Two types of reviews will be discussed (1) narrative reviews and (2) meta-analytic review In our view the primary feature that distinguishes narrative from meta analytic reviews is that the latter uses precise and for al procedures whereas the former uses the reviewers' subjective judgments to address the following set of questions (1) Is there a relationship between two (or more) variables? (e.g., consumer attitudes and purchase behavior) (2) What is the strength of that relationship? and (3) Under What conditions (e.g., subjects behaviors, measures, and contexts) will the findings hold?

Narrative reviews can be characterized as an integration of findings by either itemizing each result or by some vote-counting technique (i.e., by tabulating the number of studies consistent with a particuLar hypothesis). Proponents of meta-analysis have described narrative review as limited in their potential for useful information. For instance, Axelson, Federline and Brinberg (1985) conducted a meta-analysis on the relation between food attitudes and food related behavior. In their review nine studies were found that examined this relation. Seven of those nine studies had a nonsignificant correlation between attitude and behavior using a vote counting procedure these authors would have concluded that no relation exists between food attitudes and food related behavior. Using meta-analytic procedures, however, these authors found a statistically significant (though small) relation between food attitude and food behavior. Thus, the meta-analytic procedures reduced the likelihood of these authors asking a Type II error.

There are several limitations associated with the narrative approach. First when the number of studies is large, the reviewer may have difficulty estimating: (a) Whether there is a relationship among the variables being examined and (b) the strength and nature of that relationship. Second, there are losses of information that may occur if information is integrated incorrectly (Light and Smith 1971): (1) weak inferences; that is, the relation under study is stronger than the reviewer infers; (2) overlooked inferences; that is the reviewer ignores systematic variation in the set of findings and (3) wrong inferences; that is the reviewer integrates the set of findings in a manner that is inconsistent with the more accurate statistical integration. Third, reviewers may ignore the impact of statistical power when using just the level of significance in their integration of a set of findings (Cook and Leviton 1980). A finding may not reach statistical significance because of a lack of power but still be consistent with the direction of the hypothesized relation. Ignoring that finding or treating it as nonsupport for the hypothesis may result in a misleading conclusion as illustrated above.

Several researchers have presented limitations that apply to both narrative and meta-analytic reviews although the discussion by these authors focus on meta-analytic techniques Wilson and Rachman (1983) and Mintz (1983) have argued that bias can occur in the selection of studies both in terms of the specification of the sampling frame (e g., only published studies or only those studies with quantified results) as well as in the specification of hypothesized relations. For instance, studies that examine the attitude-behavior relation may measure attitude several ways (e.g., semantic differential, set of belief statements). The reviewer needs to make a judgment concerning what studies are appropriate for inclusion in the literature review; that is he/she needs to determine what studies will be treated as replicates (and thus, included in the review) and what studies will be treated as different (and thus excluded from the review).

Another major source of controversy in conducting reviews is the integration of seemingly unrelated studies. For instance Glass, McGaw and Smith (1981) report a meta-analysis in which they combined a wide variety of psychotherapy treatments and a wide variety of illnesses. A single overall effect size estimate concerning the effectiveness of psychotherapy was computed. That estimate, however, can be misleading because it may mask ineffective treatments. These authors point out that useful information concerning psychotherapy can be gained when the set of studies are partitioned into different groups and the scope and limits of treatments can be identified.

For most research we assume those units we treat as a replication factor are homogeneous (e g., subjects in a between-subject factorial design) The reviewer makes a similar assumption concerning the set of studies to be integrated because the studies are treated as the replication factor. Procedures (to be discussed in section two) are available to determine the uniformity of these studies and to test the viability of this assumption.

The impact of poor quality studies is another source of controversy when conducting literature reviews Glass and Kliegl (1983) argue that all studies should be included in a review and that the quality of the study should be coded and correlated with outcome variables to determine their relation. If there is a non-significant correlation the researcher might then conclude that study quality is unrelated to study outcome. Mintz (1983) and Wilson and Rachman (1983), however argue that including poor studies provides no useful information because no accurate inferences may be made concerning their outcomes. Further, Mintz (1983) argues that coding the quality of the study assumes the researcher is able to develop a valid coding system. Mintz suggests that the reviewer not include all studies but make explicit his/her reasons for excluding any Particular study from the review.

A final source of controversy concerns the treatment of studies as independent replicates of a hypothesized relation. If a single study includes more that one piece of information concerning the hypothesized relation and each piece of information is included separately in the review the researcher is facet with the problem of biasing the integration of the set of studies by placing undue weight on findings from the same study and violating statistical assumptions of independence. Two approaches have been suggested to deal with this problem (Light and Pillemer, 1984). First the researcher may develop a single statistic for that study. Second, the researcher can include each piece of information in the reviews but weigh the information in such a way that it receives the same overall weight as the findings from a single study. We do not recommend the use of the second approach because it violates the assumption of independence, although some techniques have been developed recently to adjust for nonindependent findings (Strube 1985).

The Role of Meta-Analysis in the Conduct of Research

Our position is that meta-analytic procedures should be viewed as one strategy within a general research program for studying a focal problem. The primary advantages of the meta-analytic procedures are two-fold: (1) statistical procedures are more precise than intuition in identifying the presence, strength and nature of a hypothesized relation and (2) these procedures can be used to identify the scope and limits of the hypothesized relation.

Several sources of controversy were raised concerning literature reviews which we would like to address. First, in our view, the researcher conducting a meta-analysis is faced with many of the same "subjective" choices as the narrative reviewer [e.g., specification of the sampling frame, the hypothesized relation, the quality of the study, factors that potentially interact with the study outcomes, the scope and limits of the findings) We view both procedures as subjective, although we acknowledge that the meta-analytic procedures as help the researcher be more explicit in his/her assumptions and be more precise in the integration of probability levels and effect size estimates. Second, the issue concerning the integration of unrelated studies is a question concerning the conceptual clarity of the original hypotheses. A researcher may be interested in only combining studies that operationalize a concept similarly or may be interested in examining all studies that claim to examine the same concept (even if they are operationalized differently). There is no single approach that is correct because the selection of studies depends on the purposes of the reviewer. Each reviewer needs to choose and make explicit the criteria he/she has used to select a set of studies. Third, poor quality studies (i.e., those studies with design flaws or poor execution that result in confounded findings) should not be included in an integration of a set of findings because the outcome of those studies are unclear and will add uninterpretable variation in a review. We have argued to exclude poor quality studies for two reasons: (1) poor quality studies have confounded (i.e., ambiguous) outcomes; thus, no reasonable inference can be made concerning the relation between those studies and other "external" variables, and (2) if the number of studies is sufficiently Large to examine the difference between poor and good quality studies, then the researcher is likely to have a sufficient sample size to simply use the good quality studies to examine the hypothesized relation. Fourth, to reduce the problems that occur when including multiple findings from a single study, we recommend that only a single finding from any one study be included in a review. A conservative approach (in terms of Type I errors) is to select the weakest relationship, and a liberal approach is to select the strongest relationship.

The major source of information in any literature review is not in single summary statements concerning the relation but in identifying its scope and limits. Knowing the overall significance level and the overall effect size of a set of studies provides the reviewer with limited information. Only when the reviewer is able to fit it into some nomological network will he/she come to a deeper understanding of that relation.

Statistical techniques have been developed to estimate the impact of partition variables on the presence or absence of a hypothesized relation, across a set of independent studies, and on the strength and nature of that relation. A detailed, and somewhat formal presentation of these techniques will be the topic of discussion for the remainder of this paper.

QUANTITATIVE PROCEDURES

This section describes statistical methods for analyzing multiple research studies. We begin by considering the characteristics of effect size measures used in most meta-analyses. After describing the statistical focus of meta-analytic techniques, we consider approaches to the quantitative analysis of multiple studies.

Measures of Effect Size

Numerous indices of effect size have been proposed in the social sciences. One of the wore popular indices has focused on the proportion of explained variance in the dependent variable relative to the independent variable. Statistics such as t, r, F (where df = 1), and rR (where df = 1) can be converted to measures reflecting the proportion of explained variance (EV) in a set of data as follows:

EV = r2    (1)

EV = X/N    (2)

EV = F/(F + df within)    (3)

EV = t2/(t2 + df)    (4)

Meta-analytic techniques typically use two indices of effect size. The approach advocated by Rosenthal (1983) uses an index, rR, which is the square root of any of the EV measures. The second index is based upon Glass et al. (1981) and Cohen (1977) and concerns the case of mean scores in two conditions. This index. called d. is defined

d - (M1 - M2)/s    (5)

where s is a pooled standard deviation from the two conditions. Cohen (1977) suggests that a d of .2 represents a "weak" relationship, a d of .5 represents a "moderate" relationship, and a d of .8 represents a "strong" relationship (however, see Cohen 1977, p. 24 for additional perspectives on this problem). d is mathematically related to rR (e.g., d values of .2, .S, and .8 equal rR values of .10, .25, and .37 respectively. See Cohen (1977) for the mathematical translation of d to rR and vice versa).

In equation 5, there is some controversy as to the optimal method for defining a in sample data. The value of o- in the population parameter g=(u, -u,)/o- is assumed to stem from populations with equal variances. It follows that the best estimate of o will be based upon traditional formulas used in t tests. In cases where the two groups in question represent a treatment and control group, Glass, McGaw and Smith (1981) recommend the use of the control group standard deviation as the estimate of o-. In contrast, Hunter, Schmidt, and Jackson (1982) advocate the traditional pooled estimate of o-. For most consumer applications, the pooled estimate of o- will be the most appropriate. Discussions in the sections that follow will therefore assume an s based upon a pooled standard deviation estimate. For further consideration of this issue, see Glass et al. (1981), Hunter et al. (1982), and Hedges (1981).

Few studies report the information necessary to compute d directly. However, it can be derived from the t statistic by EQUATION.

Analyzing p values

Methods for estimating the statistical significance of a relationship across independent studies as a function of p values have been reviewed by Rosenthal (1978) and his associates (Rosenthal and Rubin, 1979). The two most popular methods are the method of adding z's and the method of adding weighted z's (also called Stouffer's method). Table 1 presents five hypothetical studies that examined the relationship between an independent and dependent variable by means of a t test. Column 2 of the table presents the t values observed in each study and column 3 indicates the direction of the mean difference (+ means M1 > M2 , - means M1 < M2)- Column 4 reports the df on which each t was based. Column 5 is the one tailed p associated with each t reported in column 2. A one tailed p is always used in this approach by virtue of statistical considerations. Most statistics tests do not provide tables that readily provide these p values. A FORTRAN program for computing the p associated with a t, r, F, or statistic is presented in Veldman (1968, p. 222). Extensive tables of p values are provided by Sherman (1984). Column 6 of Table 1 presents the s score equivalents of the p values in question. These are obtained from tables of the standard normal distribution.

TABLE 1

NUMERICAL EXAMPLES FOR ANALYZING p VALUES

The method of adding z's involves summing the scores in column 6 and dividing by k. This calculation yields a z of 2.39. The one tailed p value associated with a z score of 2.39 is obtained from a table of the standard normal distribution and in this case, p - 009. For an alpha level of .05, the null hypothesis of no effect or no relationship would be rejected because .009 < .05. Note that if the vote counting procedure was applied to these studies, the result would be a failure to reject the null hypothesis.

The method of adding weighted z's is similar to the above but each z in column 6 is weighted by the degrees of freedom. This approach gives greater weighting to studies with large N. The critical z uses columns 4 and 6 of Table 1 such that

EQUATION    (6)

In this case, z' - 3.01 with an associated p of .0013. Again, the null hypothesis is rejected. Note that the method of adding unweighted z's is more conservative than the method of adding weighted z's. This results from the fact that studies with smaller N typically will observe Larger p values, and these are weighted equally in the unweighted case.

When analysis focuses on p values associated with chi square, p values are, by definition, two tailed Column 5 of Table 1 must therefore be derived by dividing the p value in half. Also, for purposes of executing the weighted z method, the weights are the respective sample sizes (N) rather than df.

One problem with the above meta-analysis procedures is the potential bias introduced by editorial policies of journals. It is commonly believed that statistically nonsignificant results are less likely to be published, the result being a tendency to underestimate the true value of p. Rosenthal (1980) has referred to this as the "file drawer problem," because numerous unpublished studies with large p values may be tucked away in the researchers' file drawers. One method for addressing this issue is to calculate the number of studies that would have to exist whose mean z is zero such that the observed z' would become marginally non-significant at the .05 alpha level. This can be calculated by

NS = ((E zj)2 / 2.706) - k    (7)

where NS is the number of studies. As an example, Rosenthal estimated that approximately 3,444 studies with a mean z of zero would have to exist in file drawers for doubt to be cast on the existence of experimenter bias effects. In the current example the number of such studies would be six.

The above approach assumes that all p values are generated from a common population. This assumption may not be correct in certain instances, in which case it is inappropriate to combine p values. If the individual p values in the studies are quite heterogeneous (holding constant df) then the assumption of a common population is questionable. The procedures outlined in the next section provide a mechanism for testing this assumption.

Analyzing Effect Size Indices

The analysis of effect size indices across studies typically involves a two stage analysis. Given k independent investigations, the first stage is to determine if the population effect sizes across the k studies are equal. If the effect sizes are homogeneous, then the second stage focuses on estimating the size of the effect, based on the k investigations. If the effect sizes are heterogeneous, then the second stage focuses on isolating the sources of heterogeneity.

Stage 1: Testing Homogeneity of Effect Sizes

Many approaches have been suggested for testing homogeneity of effect sizes. We will focus discussion on the analysis of the d statistic and correlation coefficients, emphasizing approaches that, at present, seem to have the most desirable statistical properties.

Analysis of Mean Differences. Hedges (1981, 1982, 1983) has developed an approach that is applicable to the case of independent means. The technique uses the effect size measure d. Table 2 presents an example with six studies that examine the difference between two means on a common dependent measure. The t was converted to a d statistic by EQUATION. The first step involves translating each sample d to an unbiased estimate of p by multiplying the d by 1 - (3/(4y-1) where w = u, + as - 2. This has been done in column 5 of Table 2. Let t be the unbiased estimate of i in study "j". The intermediate statistic, V, is computed for each study such that

EQUATION     (8)

Column 7 of Table 2 presents the corresponding V. Then,

EQUATION    (9)

ant TS is approximately distributed as a chi square with k - 1 degrees of freedom. For the present example, TS 14.212 - (8.889/217.79) - 13.84. The null hypothesis of homogeneous effect sizes is rejected.

TABLE 2

NUMERICAL EXAMPLE FOR ANALYSIS OF HOMOGENIETY OF d SCORES

Correlation Coefficient. Rosenthal (1983) has described an approach which can be used to test the homogeneity of a set of correlation coefficients. Table 3 presents an example with six studies that report the correlation between two variables. r is first converted to Fisher's Z using the transformation Z - .50 (ln (1 + r) - in (1 -r)), where 1t - the natural logarithm. Most statistics tests provide tables for converting r to Z. The relevant Z appear in column 4 of Table 3. An average Z is computed as follows:

EQUATION     (10)

where N - the sample size for study "j". TS is calculated by

EQUATION     (11)

TABLE 3

NUMERICAL EXAMPLE FOR THE ANALYSIS OF HOMOGENIETY OF CORRELATIONS

For the present data, Z - 102 91/282 - 365 and TS 2.24. TS is approximately distributed as a chi square with k-1 degrees of freedom. For a chi square of 6-1 - 5 degrees of freedom, the critical value of chi square for alpha - .05 is 11.07. Because 2.24 does not exceed this value, the null hypothesis of homogeneous correlations is not rejected.

Measurement Error. The above techniques do not take into account fallible measures. Hedges proposes modifications to his formulas to incorporate the effects of measurement error when good reliability estimates are available. Hedges modification involves multiplying the pooled standard deviation estimate s, in equation 5 by the square root of the reliability coefficient for the dependent variable. The adjusted t scores are then analyzed instead of the observed d scores using formulas 9 and 10. Because s is rarely provided in research reports, a simpler approach is to multiply the observed value of d by EQUATION, where ryr is the reliability coefficient for the dependent variable

For the analysis of correlations, the original r's are multiplied by EQUATION. Equation 9 is then applied to the adjusted r.

The problem with the above approaches is that they assume accurate estimates of reliability are available. This is rarely the case in consumer research. The most frequently used reliability estimate, test-retest correlation is subject to considerable sampling error and, at best, represents a crude approximation to reliability. This creates a dilemma. On the one hand, the analyst can use the crude reliability estimate with the knowledge that sampling error in the estimate could undermine the validity of the conclusions made from the analysis. On the other hand, the analyst can ignore the reliability adjustments, which is tantamount to assuming perfectly reliable measure.

We recommend a two step approach in which the test is performed first without the reliability adjustment and then with the reliability adjustment. The reliability estimates used should be a best-guess (based upon theory, past research. and present reliability data) of the lower bound of the reliability of the measures in question If the results of both analyses lead to the same conclusion, then one has increased confidence in that conclusion. However if the results of the analysis are discrepant, then any conclusions must be tentative.

Stage 2 The Case of Homogeneous Effect Sizes

If one is confident that homogeneous effect sizes exist then it is reasonable to average the effect sized in some manner to estimate the population effect size.

Mean Differences. Hedges has suggested several methods for estimating p. We will describe the approach advocated in Hedges (1982). The estimate of p, d', is calculated from

EQUATION    (12)

where all terms are as previously defined. Approximate 95% confidence intervals can be formed by first defining the constant EQUATION; and then forming the lower limit by d' - c, and the upper limit by d' + c.

Correlation Coefficients. Viana (1988) has described a method for estimating p from a set of k correlation coefficients across k studies. Table 4 presents five studies reporting correlation coefficients that will serve as a numerical example. Column 2 presents the sample size for each study and column 3 presents the observed sample r. Column 4 presents an approximation to the minimum variance unbiased estimate of p, r (Olkin and Pratt, 1958). r is computed from r by rj = rj + ((rj (1-r))/2 (Nj - 4))). The estimate of p, r' is then E ((rj (Nj - 1))/TN - k))),, where TN - ENj. For the data in Table 4, r' - .312.

An approximate 95% confidence interval can be constructed for cases where p < .70, a common situation; in consumer research. This involves transforming each observed sample r to Fisher's Z and then calculating a weight, Y, for each Z, w is defined as follows:

w = (Nj   -   3) (1 + .5(Nj - 1))

       (TN -  3k) (1 + .5(Nj - 1))    (13)

She approximate 95% confidence interval is constructed by defining the 2 constants c = 1.96 ( E (W2j / (Nj - 3))) and f = 1 + E(wj /2 (Nj- 1))). The lower limit of the interval is ((E wj Zj ) - c)/f and the upper limit of the interval is ((E wj Zj) + c)/f. This yields values of .088 and 520 for the data in Table 4.

TABLE 4

NUMERICAL EXAMPLE FOR ESTIMATING THE POPULATION CORRELATION

Measurement Error. Estimates calculated in the above techniques can be adjusted for measurement error using procedures discussed earlier This involves calculating an adjusted d or an adjusted r and applying the formulas to the adjusted values rather than the observed values.

Biased Estimates. The validity of all estimation procedures are subject to editorial policies of journals.

To the extent that journals only accept articles reporting "large" effect sizes and reject articles reporting "weak" effect sizes (ant which are methodologically sound) then estimates will be biased upward This possibility should be considered when interpreting effect size estimates. Recent perspectives on this problem and methods for addressing it are discussed in Hedges (1984).

Orwin (1983) has developed an analogy to Rosenthal's "file drawer" formula for measures of d. One can compute the number of studies ("fail-safe" n) whose mean d is zero that would have to exist in order to reduce the sample mean d to some specified value (e g , 2 corresponding to a "weak" effect in Cohen's analysis, .5, a "medium" effect, or a .8, a "strong" effect). The estimated number of studies is derived by

NS = (k (d - dc))/dc    (14)

where NS - the number of studies in the "file drawer," k the number of studies on which the mean sample effect size is based, d - the absolute value of the unweighted mean of the observed d scores across the k studies and d = the criterion value (e.g., .2 .5 .8).

There are several limitations to equation 14. First d must always be greater than zero. At best, the investigator can compute the "fail-safe" n for some trivial criterion effect size (e.g., .01) but not for an effect size of zero. Second, the formula assumes that the k studies have equal N. If the N are not very discrepant, then equation 14 should give a reasonable approximation to the "fail-safe" n. We therefore recommend equation 14 be used cautiously and as a rough guideline.

A similar formula can be derived for correlation coefficients The formula is

NS = (k (Zo - Zc))/Zc    (15)

where Zo is the Fisher Z transform of the mean sample correlation across the k studies, Zc is the Fisher Z transform of the criterion correlation value, and XS and k are as previously defined Equation 15 has the same limitations as equation 14.

Stage 2: The Case of Heterogeneous Effect Sizes

When the effect sizes are found to be heterogeneous, it is not appropriate to average them to estimate the relevant population parameter. Instead, additional analyses are required to determine which subsets of effect sizes are heterogeneous. This can be accomplished from either an exploratory perspective or a theory driven perspective. We will consider each in turn.

Exploratory Analyses. Hedges and Olkin (1983) discuss exploratory procedures for the analysis of d and the correlation coefficient, r. One procedure is based on the Bonferonni multiple comparison procedure developed in analysis of variance. We will first consider the case of d.

The procedure begins by rank ordering effect rises for the k studies from smallest (rank - 1) to largest (rank - k). This has been done in column 2 of Table 5, which presents d measures (column 3) for five studies. The dj are first transformed to U statistics by , - 2 c, nw , where c = EQUATION - 1.0536 and u - (1/k ( , nw )). n refers to the number of subjects per group in a given study, assuming equal n in the two groups (the case of unequal n will be considered shortly). Column 4 presents the U measure for each d. The absolute difference between two effect sizes is tested against zero by comparing the absolute difference between corresponding U's with a critical value. The critical value is determined by specifying the number of "steps" ns, between the effect sizes. ns will equal the larger of the two ranks minus the smaller of the two ranks plus one. For example, to compare effect sizes for study F and X, ns - 5 - 3 + 1 - 3. The absolute difference between UF and UD is .69 and the critical value (alpha - 05) is 3.68. Because .69 does not exceed 3 8, we fail to reject the hypothesis of homogeneous effect sizes for these studies. Using this approach, exploratory comparisons can be made with the possibility that post hoc analyses of differences in effect size will make substantive sense when study characteristics defining "similar" and "different" effect sizes are considered.

TABLE 5

NUMERICAL EXAMPLE FOR ANALYSIS OF d MEASURES BASED AND HEDGES AND OLKIN

For correlation coefficients, the r is transformed to U by first transforming the sample r to Fisher's Z and then multiplying the Z by EQUATION.

Both approaches formally assume equal N across studies, although Hedges and Olkin suggest that unequal N will not drastically affect the outcome. Further the analysis of a assumes that the two conditions within a study have equal n. In cases of unequal n within z study, a conservative approach is to define n by the smaller n of the two, a liberal approach is to define n by the larger n of the two, and a compromise is to define n as the mean n.

Theoretical Analysis. Hedges (1982) discusses a useful approach for analyzing heterogeneous d measures when the measures can be grouped into clusters a priori on the basis of theoretical concerns. For the data in Table 5, assume the first three studies were conducted with upper-class individuals whereas the second three studies were conducted with working class individuals. Using equation 9, the overall chi square, TS is calcuLated to test the null hypothesis of homogeneous effect sizes across all k studies. The total chi square is then decomposed into two components, one due to between group differences (as defined by social class) and the other due to within group differences. This procedure is analogous to the partitioning of variance in analysis of variance problems. The chi square for between group differences is calculated as follows: Let p = the number of clusters (in this case p - 2 upper and working class) m - the number of studies in cluster i, tj the unbiased estimated effect size of study j in cluster i, v - the V measure for study J in cluster i and t' - the average d measure in cluster i (note the mathematical definition of t' is given shortly). Then, the between groups chi square, TSB, is

EQUATION    (16) and (17)

TSB is distributed as a chi square with p - 1 degrees of freedom. For the d ta in Sable 5, t' - - 137, t' - .317, and TSB - 7 50 The chi square for within group differences is TSW - TS - TSB 13 84 - 7.50 - 6.34. SSW is distributed as s chi square with k - p degrees of freedom.

The critical value for TSB alpha - 05 is 3.84, Because 7.5 exceeds 3. 84, the null hypothesis of homogeneous population d is rejected. It is useful to calculate 95% confidence intervals for each d, (see discussion of equation 12). For upper class studies, these are -.285 to 012. For lower class studies, they are .028 to .606. Non-overlapping confidence intervals indicate statistically significant differences.

The critical value for TSW, alpha - .05 is 9.49. Because 6.34 is less than 9.49, we fail to reject the null hypothesis of homogeneous effect sizes within clusters, indicating that the calculations of d for the clusters is meaningful. If TSW is statistically significant, then equation 11 can be applied to each cluster separately to determine which group(s) has heterogeneous effect sizes.

The above procedures holt for the case of t measures. Analogous procedures are applicable to the case of correlation coefficients. Each correlation coefficient is converted to Fisher's Z. The total chi square, TS, in computed using equation 11. The between group chi square involves calculating an average Z Z for each cluster separately using equation 10. Let N the sum of the sample sizes across studies in cluster i, and Z.. = the average Z (using equation 10) across all studies. Then

EQUATION (18)

ant TSi - TS - TSB. TSB is approximately distributed as a chi-square with p-l degrees of freedom. TSW in approximately distributed as a chi square with k-p degrees of freedom. Significance tests are applied as above. Approximate 95% confidence intervals about each mean r can be calculated, as described earlier.

SUMMARY AND CONCLUSIONS

In this paper, we have discussed two basic features in the use of meta-analytic techniques. First, we described the strengths and weaknesses of literature review , in general, and meta-analytic reviews in particular. Second, we discussed a set of statistical techniques that can be used for summarizing the findings across a set of independent studies. In that discussion, we described two basic summary statistics; a correlation coefficient between an independent and dependent variable and a t statistic (Glass, McGaw and Smith, 1981) as well as discuss d the impact of unequal sample size and measurement error on those statistics. Researchers are currently examining the statistically properties of these meta-analytic techniques in great detail (e.g, Hedges, 1984).

In many areas of the social sciences, meta-analytic techniques are receiving increased use as a tool for the systematic integration of a body of literature. Our goal in this paper is two-fold: (1) to make the reader aware of current statistical advances in meta-analytic techniques and (2) to describe how meta-analytic techniques can be used to increase a researchers' understanding about the many facets that can influence a set of findings. Our hope is that the reader will make greater use of these techniques in their own area of interest.

REFERENCES

Assmus, G., Farley, J.U., and Lehmann, D.B. (1984). How advertising effects sales: A meta-analysis of econometric results. Journal of Marketing Research, 21, 65-74.

Axelson, M.L., Federline, T.L., and Brinberg, D. (1985). Foot- and Nutrition-related knowledge, attitudes, and behavior - A meta-analysis. Journal of Nutrition Education, 17(2), 51-54.

Churchill, G.A., sat Peter, J.P. (1984). Research design effects on the reliability of rating scales: A meta-analysis. Journal of Marketing Research, 21, 360-375.

Cohen, 1. (1977). Statistical power for the behavioral sciences. New York: Academic Press.

Cook, t.D., and Leviton, L.C. (1980). Reviewing the literature: A comparison of traditional methods with meta-analysis. Journal of Personality, 48, 449-472.

Fisher, t.A. (1948). Combining independent tests of significance. American Statistician, 2, 30.

Glass, G.V., & Kliegl, R.X. (1983). An apology for research integration in the study of psychotherapy. Journal of Consulting and Clinical Psychology, 51, 28-41.

Glass, G.V., McGaw, B., & Smith, M.L. (1981) Meta-Analysis in Social Research. Beverly Hills, CA.: Sage Publications.

Hedges, L.V. (1981) Distribution theory for Glass's estimator of effect size and related estimates. Journal of Educational Statistics, 6, 107-128.

Hedges, L.V. (1982). Estimation of effect size from a series of independent experiments. Psychological Bulletin, 92, 490-499.

Hedges, L.V. (1983). A random effects model for effect sizes. Psychological Bulletin" 22, 388-395.

Hedges, L.V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding nonsignificant mean differences Journal of Educational Statistics, 9, 61-85

Hedges, L.V , and Olkin, I. (1983). Clustering estimates of effect magnitude from independent studies. Psychological Bulletin, 93, 563-573.

Houston, X J., Peter, J.P., and Sawyer, A.G. (1983) The role of meta-analysis in consumer research. In R.P. Bagozzi and A.M. Tybout (Eds.) Advances in Consumer Research, 10, 497-502.

Hunter, J.L., Schmidt, Y.L., & Jackson, G.B (1982) Meta-analysis: Cumulating research findings across studies. Beverly Hills, Ca : Sage Publications.

Light, R.J., and Smith, P.V. (1971) Accumulating evidence: Procedures for resolving contradictions among different studies. Harvard Educational Review. 41, 429-411.

Light, R.J., and Pillemer, D.B. (1984). Summing Up. Cambridge, Mass.: Harvard University Press.

"(For remaining references, please contact author.)"

----------------------------------------