Research Design Effects on the Reliability of Rating Scales in Marketing: an Update on Churchill and Peter

Elizabeth J. Wilson, Louisiana State University
ABSTRACT - A meta-analysis using the recently published Handbook of Marketing Scales is conducted to note similarities and differences in reliability estimate information as presented in a 1984 study by Gilbert Churchill and Paul Peter. A key research question is "how do research design characteristics affect the reliability (internal consistency) of a measurement scale?" Consistent with Churchill and Peter, we examine sampling characteristics, measure characteristics, and measure development characteristics to note whether any of these variables are systematically related to differences in the reliability of rating scales. Our results update and, to a large extent, reaffirm findings of Churchill and Peter even though almost ten years have past and very few of the scales included in the Handbook overlap with those included in their meta-analysis. In general, we found that rating scales used in marketing have an average reliability of 0.81; this indicates that researchers seem to be successfully meeting the challenges of developing measurement tools with a relatively high degree of psychometric rigor.
[ to cite ]:
Elizabeth J. Wilson (1995) ,"Research Design Effects on the Reliability of Rating Scales in Marketing: an Update on Churchill and Peter", in NA - Advances in Consumer Research Volume 22, eds. Frank R. Kardes and Mita Sujan, Provo, UT : Association for Consumer Research, Pages: 360-365.

Advances in Consumer Research Volume 22, 1995      Pages 360-365

RESEARCH DESIGN EFFECTS ON THE RELIABILITY OF RATING SCALES IN MARKETING: AN UPDATE ON CHURCHILL AND PETER

Elizabeth J. Wilson, Louisiana State University

ABSTRACT -

A meta-analysis using the recently published Handbook of Marketing Scales is conducted to note similarities and differences in reliability estimate information as presented in a 1984 study by Gilbert Churchill and Paul Peter. A key research question is "how do research design characteristics affect the reliability (internal consistency) of a measurement scale?" Consistent with Churchill and Peter, we examine sampling characteristics, measure characteristics, and measure development characteristics to note whether any of these variables are systematically related to differences in the reliability of rating scales. Our results update and, to a large extent, reaffirm findings of Churchill and Peter even though almost ten years have past and very few of the scales included in the Handbook overlap with those included in their meta-analysis. In general, we found that rating scales used in marketing have an average reliability of 0.81; this indicates that researchers seem to be successfully meeting the challenges of developing measurement tools with a relatively high degree of psychometric rigor.

Churchill and Peter (1984) conducted a landmark study in which they used meta-analysis to determine whether particular research design variables (type of sample, type of subjects, use of reverse scoring, etc.) affected the overall reliability of rating scales in marketing. In other words, they studied the question, "what research design variables are likely to make one measurement scale more reliable than another?" If a researcher knows, for example, that a scale composed of semantic differential items with both numerical and verbal labels tends to have a higher reliability score, on average, than a scale using Likert-type items with labels on the polar points only, then that researcher can build in those aspects when designing a new measure.

The present study builds upon the work of Churchill and Peter (1984) to update and extend their findings. Bearden, Netemeyer, and Mobley's (1993) Handbook of Marketing Scales provides a set of scales and related developmental information for which a similar meta-analysis is conducted. Results of this study are compared to those from Churchill and Peter to note similarities and differences over time.

A meta-analytic review of Bearden et al. and comparison to findings of Churchill and Peter may be a useful contribution for three reasons. First, a programmatic update on this issue is needed since Churchill and Peter's findings are now almost ten years old. Second, new knowledge may be gained by this examination because there is relatively little overlap between the studies included in Churchill and Peter and Bearden et. al. Churchill and Peter examined 154 scales (from 107 studies) and Bearden et al. examined 122 scale&-only 15 of the scales are common between the two. Third, Bearden et al. have a relatively wider domain compared to Churchill and Peter. Bearden et al. include scales from the psychology and organizational behavior literature because marketing scholars in particular areas tend to use these measures.

Next, a brief review of Churchill and Peter's findings is provided. The method used to conduct the meta-analysis is explained and findings are presented. Implications for measure development are offered along with concluding comments.

BACKGROUND

Churchill and Peter (1984; designated CP hereafter) studied effects of three general types of research design effects on the average reliability of 154 rating scale measures. Sampling characteristics, measure characteristics, and measure development procedures were examined to note whether particular characteristics and/ or procedures tend to be related to higher levels of reliability (internal consistency) in rating scale measures. Each of these characteristics is discussed next. Interested readers are urged to refer to Tables 1-3 in CP for more detailed information.

Sampling Characteristics

Sampling characteristics are research design elements such as type of sample (nonprobabifity or probability), sample size, type of subjects (students, non-students), and method of data collection, to name a few. CP investigated these and other sampling characteristics; for some, formal hypotheses were stated. For example, CP expected college student samples to produce more reliable measures than samples using other types of respondents. The average reliability (expressed as coefficient alpha) of scales developed and tested using college students was a=0.81 which is acceptable by psychometric standards (Nunnally 1978). However, this level of reliability was not significantly higher than the reliabilities computed from scales developed and tested using non-student samples (members of organizations (a=0.78), head of household/housewife (a=0.75), combination (a=0.71), other (a=0.66)). Thus, the hypothesis was not supported. In short, none of the sampling characteristic variables studied had a significant, predicted effect on a measure's reliability (the size of the validation sample had a significant negative effect on reliability and was not predicted a priori).

Measure Characteristics

Measure characteristics include number of items in the scale, number of dimensions, use of reverse scoring, and number of scale points, to name a few. CP found that two of nine measure characteristics had a significant effect on the level of reliability in a rating scale measure. First, a positive relationship between the number of items used and the reliability of the measure was proposed. This hypothesis was supported; the average number of items across 154 scales was 13.5. Since the hypothesis was examined via regression analysis, CP do not provide the average levels of reliability for measures having few compared to many items.

Second, a positive relationship was proposed to exist between the number of scale points (in the items) and the reliability of a measure (over a normal range). This hypothesis was supported based on 131 studies. The average number of scale points (per item) was 5.8; but again, CP do not report a breakdown of the average a value for measures containing items with many scale points compared to those items having few.

Hypotheses regarding the use of reverse scoring, dimensionality of scale measures, item difficulty, and extent of scale point description were not supported. No relationship was proposed a priori, nor found empirically, between measure reliability and type of item construction (Likert, semantic differential, etc.) or type of labels used on items (numerical, verbal, etc.).

Measure Development Procedures

Measure development procedure variables include the source of a scale (originally developed, borrowed-modified, borrowed-unmodified), procedures used to generate items (literature review, interviews, etc.), and whether the construct domain was defined, to name a few. None of CP's hypotheses pertaining to these elements of research design were supported. For example, scales borrowed from other domains did not have higher reliability scores than those that were newly developed specific to a marketing application. No relationship was proposed nor found regarding a priori specification of a measure's dimensionality or whether a measure was specifically investigated for dimensionality.

In summary, CP concluded that it was the characteristics of the measures themselves (the actual stimuli that subjects respond to) that accounted for differences in the reliability of rating scale measures. In other words, properties of the measures themselves were more influential in explaining differences in reliability compared to sampling characteristics or measure development procedures.

Method

In replicating and updating the CP study, their methodology was followed as closely as possible. A meta-analysis of the scale measures in Bearden et al. (1993; designated BNM hereafter) was conducted and data were analyzed in much the same way,with a few exceptions which are discussed in the Findings. Before providing the details of the method, a brief explanation is offered regarding the meta-analytic procedure.

Meta-Analysis

Meta-analysis is the statistical summary of findings across studies. In summarizing findings across many studies, researchers can better understand sources of variation about some phenomenon. Meta-analysis has been increasingly used in marketing; in some areas a quantitative cumulation or synthesis of findings may offer greater insight compared to another narrative literature review. Recent examples include Wilson and Sherrell (1993) and Brown and Stayman (1992).

To do a meta-analysis, a researcher gathers as many original studies as he/she can find on a topic. The findings of the individual studies, usually expressed as in terms of statistical effect-sizes or amount of explained variance (1.2, 0)2), become the dependent variables for the meta-analysis while other aspects of the studies (method characteristics and substantive characteristics) become independent variables. In this study, as in CP, the major dependent variable is the reported estimate of reliability. Independent variables are research design elements as discussed above. For those not familiar with meta-analysis, a few excellent sources of information are Rosenthal (1991); Hunter and Schmidt (1990); Houston, Peter, and Sawyer (1983); and Monroe and Krishnan (1983).

Procedure

We use BNM as our source of studies for the meta-analysis. They report scale development and testing information for 122 measures across six general areas that have been studied by marketing researchers (individual traits, values, involvement/information processing, advertising stimuli, attitudes about business, and sales/ firm issues). A coding form was developed based on Tables 1-3 of CP; sampling, measure, and measure development characteristics were included so that results could he summarized and compared to those of CP. The variables included on the coding form are shown in Tables 3-5 of the present article.

Data were coded by two judges independently, following the same procedure as CP. After a set of studies was coded, the judges met to resolve any inconsistent evaluations. Sc, for each scale in BNM, we coded the reliability score and information regarding the sampling, measure, and measure development characteristics.

Additional information from subsequent research on measure reliability was sometimes included in BNM as supporting evidence. In these cases, an average reliability score was recorded for that particular scale. When multiple types of reliability were reported, those coefficients representing internal consistency were used first (coefficient alpha, reliability of the linear combination, composite reliability in structural equations (LISREL) applications) with other coefficients (split-half, Spearman-Brown, alternate forms, test-retest) used in the absence of internal consistency estimates. This was done to be consistent with CP.

Of the 122 measures presented in BNM, most could be included in the meta-analysis. Nine measures were not included since the original authors did not report quantitative estimates of reliability. In a few cases, missing information on some of the independent variables prevented a measure from being included in a particular analysis. So, the number of observations of reliability estimates generally ranges between 110-113 for the analyses and comparisons with CP. Sample size information for each statistical test is presented in Tables 3-5.

FINDINGS

Summary statistics for the reliability estimates are provided in Table 1. All reliability coefficients for the measures included in BNM were 0.60 or above. This is due to the fact that BNM's stated purpose was to compile multi-item, self-report measures developed and/or frequently used in marketing (BNM, 1993, p. 1). Many of the scales had been investigated by other researchers, in addition to the measure developers, to add further evidence as to the psychometric integrity of the particular measure. This is also reflected in the average reliability estimate (0.81) across studies.

In Table 2, the average reliability estimates are broken down into six groups according to the Chapters in BNM which represent general areas of marketing research. Measures of reactions to advertising stimuli and measures of sales/firm issues had the highest average reliability scores (0.84 and 0.83, respectively) while measures of individual traits and values had the lowest scores (0.78 and 0.77, respectively). The differences in the average reliability scores across the groups is significant.

Findings regarding reliability estimates and five sampling characteristics are shown in Table 3. We used the same type of analysis approach for each comparison (regression or ANOVA) as in CP for consistency. Comparisons of results to CP are shown in the right-most column of Table 3. In short, none of the sampling characteristics had a significant effect on the level of reliability of a rating scale measure. For example, there was no difference in reliability for measures using nonprobability samples compared to probability samples, no difference in reliability for studies using students compared to non-students, and so on. Results from BNM match those of CP except for one comparison. CP found, but did not hypothesize a priori, that measures developed using a smaller validation sample tended to be associated with higher average reliability estimates. We did not find this relationship to hold.

CP examined seven sampling characteristics while this study includes five. Insufficient information was available in BNM to do comparisons on "response rate" and "number of samples used to develop the measure."

TABLE 1

SUMMARY STATISTICS FOR THE RELIABILITY OF SCALE MEASURES ACROSS STUDIES

TABLE 2

AVERAGE RELIABILITY ESTIMATES BY CHAPTER IN BNM

The effects of measure characteristics on reliability estimates are reported in Table 4. Eight of nine measure characteristics included in CP are included in the present analysis. Insufficient information was available to include "difficulty of items."

In six of eight comparisons, results from the meta-analysis of BNM match those of CP. Inconsistent results were found for "number of items in the final scale" and "number of scale points." CP obtained significant differences in reliability estimates for these two variables; however, no significant difference was obtained in the present analysis.

Finally, the impact of measure development procedures on reliability estimates is shown in Table 5. Data from BNM is consistent with CP in two of three comparisons. Reliability estimates differ significantly depending on the original source of the scale (F=4.96, 2 and 110 d.f., p<009, developed scales have a higher reliability score, on average compared to borrowed-unmodified scales). However, as noted in the Table, this difference must be viewed with caution due to small cell sizes.

DISCUSSION

Results of the present meta-analysis reaffirm, to a large degree, the findings of CP. In a total of 16 comparisons, four comparisons were "misses" while 12 were "hits" in terms of matching of results. In other words, CP's results are confirmed in 75 percent of the comparisons to BNM. This result is robust (X2=26.00, I df, p<0.0 1) considering that almost ten years have past and that most of the scales do not overlap those included in CP's original study.

Implications for Research

Although our results largely mirror those of CP, what does this say about whether the reliabilities of rating scales can be affected by research design characteristics? For our results, significant differences in reliabilities were found for only one variable (source of scale). Thus, it seems that research design characteristics may have a minimal effect, at best, on levels of reliability in rating scales. To examine this issue further, we conducted an additional analysis using a subset of study characteristic variables.

The variables included in this post-hoc analysis are "nature of sample" (nonprobability, probability, other), "type of sample" (students, family members, organizational members), "type of research" (correlational, experimental), "respondent urgency" (forced choice, neutral point, other). These variables were chosen because they are salient study characteristics which may be more likely to influence the strength of manipulations and statistical effect sizes (Bearden, Netemeyer, and Mobley 1993; Calder, Phillips, and Tybout 1981; Peterson, Albaum, and Beltramini 1985) compared to other, more cursory characteristics (e.g., use of reverse scoring, type of label, etc.).

We entered all four variables (coded as effects) into the regression equation with average reliability as the dependent variable. Adjusted R2 is .26 (F=4.13, p<.001) indicating that the average reliability of a rating scale does tend to change given different levels of these independent factors. Results of the model are shown in Table 6 and are explained next.

Scales which have a neutral point for respondents tend to be associated with higher levels of reliability compared to those that use a forced choice or some other response mechanism (B=.25, p<05). Scales which are borrowed from other literatures and unmodified for use in marketing studies tend to be associated with lower reliabilities, as one might expect, (B=-.33, p<001). Scales used in correlational research, as opposed to experimental studies, tend to be associated with higher levels of reliability (B=.32, p<01). This result must be viewed with some caution, however, since the number of observations is skewed. There were 108 correlational studies and only 5 experiments. The levels of the final variable, sample type, did not yield any significant differences in the average reliability of rating scales.

TABLE 3

IMPACT OF SAMPLING CHARACTERISTICS ON RELIABILITY ESTIMATES

(ADAPTED FROM CHURCHILL AND PETER, 1984)

In sum, this analysis points out interesting trends. For example, in terms of increasing the reliability of a rating scale, including a neutral point in the type of response may be better than a forced choice, and use of a specially developed scale may be better than using a borrowed, unmodified scale. Calder et al. (1981) advocate the use of experiments in marketing over non-experimental-formats because of enhanced control for stronger manipulations and statistical effects. The reliabilities of rating scales were not higher for the experimental studies, on average, although this may be due to the lack of a large number of observations on this comparison.

Limitations

One limitation of this analysis is the use of a "convenience sample" of scales in BNM to replicate CP's work. Although a very useful tool for researchers, the sample of observations regarding reliability of rating scales lacks much variance and consequently we do not find substantial differences in our comparisons. This finding may be because research design characteristics really don't have a substantial impact on scale reliabilities. In addition, this finding may be due to BNM's criteria for inclusion in the Handbook-only multi-item scales which tend to be used by marketing scholars. No single item measures nor infrequently used measures were included. Thus, BNM seem to have mostly "established" measures while CP may have had a larger proportion of less psychometrically sound measures. Indeed, they report a larger range of reliability estimates (0.26 to 0.99) whereas all of the measures in BNM had a reliability of at least 0.60. The average reliability score from measures included in BNM is higher than that from CP (0.81 compared to 0.75).

Bruner and Hensel (1992) also have a compilation of scales used in marketing. This information was not included in the meta-analysis as a database because it would have replicated information in BNM. In addition, BNM summarized scales which measure marketing-related "traits" while Bruner and Hensel's compilation includes scales for general constructs such as general affect and satisfaction and more specific constructs such as cooking enjoyment and sales agent contact frequency. A future replication of the present study should include information from Bruner and Hensel (1992).

TABLE 4

IMPACT OF MEASURE CHARACTERISTICS ON RELIABILITY ESTIMATES

In conclusion, the status of the reliability of rating scale measures in marketing seems to be quite good and improving based on this comparison. Scales used to measure marketing traits are being developed with care and rigor; else, average reliability coefficients would not be at the levels found based on the information in CP and BNM.

TABLE 5

IMPACT OF MEASURE DEVELOPMENT PROCEDURES ON RELIABILITY ESTIMATES

TABLE 6

MODELING RESULTS FOR A SUBSET OF INDEPENDENT VARIABLES USING EFFECT CODING

REFERENCES

Bearden, William 0., Richard G. Netemeyer, and Mary F. Mobley (1993), Handbook of Marketing Scales, Newbury Park, CA: Sage Publications.

Brown, Steven P. and Douglas M. Stayman (1992), "Antecedents and Consequences of Attitude toward the Ad," Journal of Consumer Research 19 (June): 34-51.

Bruner, Gordon C. and Paul J. Hensel (1992), Marketing Scales Handbook, Chicago, IL: American Marketing Association.

Calder, Bobby J., Lynn Phillips, and Alice M. Tybout (1981), "Designing Research for Application," Journal of Consumer Research, 8 (September), 197-207.

Churchill, Gilbert A. and J. Paul Peter (1984), "Research Design Effects on the Reliability of Rating Scales: A Meta-Analysis," Journal of Marketing Research 21 (November) 360-375.

Houston, Michael J., J. Paul Peter, and Alan G. Sawyer (1983), "The Role of Meta-Analysis in Consumer Behavior Research," In Advances in Consumer Research 10 (Richard Bagozzi and Alice Tybout, eds.), Ann Arbor, MI: Association for Consumer Research, 497-502.

Hunter, John E. and Frank L. Schmidt (1990), Meta-Analysis: -Cumulating Research Findings Across Studies, Beverly Hills, CA: Sage.

Monroe, Kent B. and R. Krishnan (1983), "A Procedure for Integrating Outcomes Across Studies," In Advances in Consumer Research 10 (Richard Bagozzi and A] ice Tybout, eds.), Ann Arbor, MI: Association for Consumer Research, 503-508.

Nunnally, Jum (1978) Psychometric Methods, New York: McGraw Hill.

Peterson, Robert A., Gerald Albaum, and Richard Beltramini (1985), "A Meta-Analysis of Effect Sizes in Consumer Behavior Experiments," Journal of Consumer Research, 12 (June): 97-103.

Rosenthal, Robert (1991), Meta-Analytic Procedures for Social Research, Beverly Hills, CA: Sage.

Wilson, Elizabeth J. and Daniel L. Sherrell (1993), "Source Effects in Communication and Persuasion Research: A Meta-Analysis of Effect Size," Journal of the Academy of Marketing Science 21 (2): 101-112.

----------------------------------------