Reliability, Generalizability and Consumer Behavior

J. Paul Peter, Washington University
ABSTRACT - This paper is designed to provide the consumer behavior researcher with resources for better understanding reliability theory and estimation. First, traditional conceptual and operational definitions of reliability are discussed. Second, attention is given to the reformulation of reliability in terms of generalizability theory. Lastly, the discussion centers on issues which have discouraged reliability estimation in consumer behavior research.
[ to cite ]:
J. Paul Peter (1977) ,"Reliability, Generalizability and Consumer Behavior", in NA - Advances in Consumer Research Volume 04, eds. William D. Perreault, Jr., Atlanta, GA : Association for Consumer Research, Pages: 394-400.

Advances in Consumer Research Volume 4, 1977   Pages 394-400

RELIABILITY, GENERALIZABILITY AND CONSUMER BEHAVIOR

J. Paul Peter, Washington University

ABSTRACT -

This paper is designed to provide the consumer behavior researcher with resources for better understanding reliability theory and estimation. First, traditional conceptual and operational definitions of reliability are discussed. Second, attention is given to the reformulation of reliability in terms of generalizability theory. Lastly, the discussion centers on issues which have discouraged reliability estimation in consumer behavior research.

INTRODUCTION

Consumer behavior researchers are genuinely concerned with approaching their area of inquiry in a scientific fashion. However, reliability, which is considered a necessary (but not sufficient) condition for the validity and value of research results and their interpretation, has received little emphasis in the consumer behavior literature. This paradox can be partially explained by the fact that 'reliability' has a variety of meanings and definitions, and there is considerable controversy concerning the term.

This paper discusses a variety of interpretations of reliability as well as a number of reasons why reliability has received little emphasis in consumer behavior research. Its primary purpose is to provide a resource for consumer behavior researchers interested in reliability theory and in estimating the reliability of their measures. Although a number of unresolved issues in reliability theory are not dealt with in detail, an attempt has been made to provide appropriate references for the interested reader. With this in mind, the paper is divided into three sections.

First, attention is given to traditional conceptual and operational definitions of reliability in the psychometric sense. Thus, interest centers on "psychometric reliability" which is concerned with measurement error rather than "statistical reliability" which is concerned with sampling error (see Broedling, 1974). Further, primary emphasis in this section is placed on correlational or variance approaches rather than methods of measuring agreement among judges or raters (for discussions of the latter, see Lawlis and Lu, 1972; Fleiss and Cohen, 1973).

Second, attention is given to two reformulations which attempt to clarify the problems in traditional reliability theory. These reformulations have in common the goal of providing unified conceptual and operational frameworks for handling all potential sources of variance in a measuring procedure, including measurement and sampling errors. The first approach, which is properly labeled "generalizability theory," was developed by Cronbach and his associates. The second approach is that of Cattell and his associates and is not usually referred to as generalizability theory. However, it is discussed in this section because of its communality of goals with generalizability theory.

Lastly, reliability is discussed in relation to consumer behavior research. Emphasis in this section is placed on six issues which have discouraged reliability estimation in the area.

TRADITIONAL RELIABILITY DEFINITIONS

This section is concerned with traditional conceptual and operational approaches to defining reliability. Before proceeding, it is interesting to note that three of the classic references in the psychological literature employ three separate sets of assumptions for deriving reliability theory. Guilford (1954) employs the Spearman-Yule theory of true and error factors; Gulliksen (1950) employs the Brown-Kelley theory of statistically equivalent tests; Nunnally (1967) employs the domain sampling approach of Tryon (1957). Although a comparison of these assumption structures is not necessary here and is provided elsewhere (Tryon, 1957), this could well be a source of confusion for the consumer behavior researcher attempting to understand reliability.

Conceptual Definitions

One approach to specifying the meaning for a construct is to present words or phrases as synonyms for it (Kaplan, 1964, p. 72). Numerous words have been used as synonyms for reliability, e.g., dependability, precision, accuracy, stability, consistency, predictability, repeatability, constancy. However, since the construct of reliability has only systemic meaning these synonyms represent alternative interpretations but not definitions of reliability. These synonyms can be traced back to three approaches to defining reliability.

The first approach to a definition of reliability epitomizes the question: If we use the same measures over and over again, will we get the same results? This definition focuses on repeatability or constancy but other "synonyms" are also implied. Although recent business (Emory, 1976) and marketing (Churchill, 1976; Tull and Hawkins, 1976) research texts have taken a more sophisticated view, this is perhaps the most common approach to defining reliability in marketing.

A second approach deals with the question of whether or not measures obtained from an instrument are the "true" measures of the property measured. This definition focuses on the accuracy or precision of measures and clearly implies stability.

A third approach to the definition of reliability focuses on how much error of measurement there is in a measurement instrument. To the extent that errors of measurement are present in a measuring instrument, to this extent the instrument is unreliable. Thus, reliability can be defined as the relative absence of errors of measurement in a measuring instrument and is associated with random or chance errors (Kerlinger, 1973, p. 443). A reliable measurement in this context will be dependable, stable, consistent, predictable and accurate if its primary source of variance is systematic variance and error variance is minimal. The following explanation of reliability in this sense is based on Guilford (1954) and Kerlinger (1965, 1973) and emphasizes the Spearman-Yule theory of true and error factors. Although this approach has been justifiably criticized by Tryon (1957) because of some unobjective (and unnecessary) assumptions about the nature of correlations between true and error components, void of these assumptions, the framework provides an excellent approach for understanding reliability in a traditional sense.

This approach starts with the notion that the mean and variance of any observed scale score can each be divided into two parts. In terms of the mean, the two parts are (1) the true score and (2) the error score. The true score is a perfect measure of the property but is never really known: in practice, it is considered to be the mean of a large number of administrations of the same test to the same person. The error score is an increase or decrease from the true score resulting from errors of measurement -- the source of unreliable measurement. Symbolically,

Xobserved = Xtrue +  Xerror    (1)

The variance of an obtained measure also is assumed to have two parts -- a "true" component (which may also include systematic variance other than that from the phenomenon under investigation) and an error component or symbolically,

Vobserved = Vtrue +  Verror    (2)

Although in practice, Xtrue, Xerror and Vtrue are never really known, it is possible to estimate Verror, substitute it in and solve the variance equation for Vtrue. The reliability coefficient (rtt) is the proportion of "true" variance to the total variance of a measurement instrument or

rtt = Vtrue/Vobserved  (3)

This definition exemplifies the fact that if Verror is small and Vtrue is large, the measure has high reliability and vice versa. However, as a practical matter since Vtrue is not known, the equation is theoretical and cannot be used for computation unless Verror is first estimated. Thus, since Vtrue = 1 - Verror, equation (3) can be rewritten as

rtt = 1 - Verror/Vobserved    (4)

or by further multiplying through by Vobserved

rtt = Vobserved - Verror / Vobserved    (5)

These two equations are both theoretical and practical. As a theoretical matter, equations (&) and (5) exemplify the notion that error or random variance reduces the reliability of measures and are consistent with the definition of reliability as the relative absence of errors of measurement.

An analysis of variance approach has been suggested (e.g., Hoyt, 1941; Alexander, 1947; Burr, 1955) for estimating reliability and the ANOVA model is perfectly consistent with the conceptual model. Basically, the analysis of variance approach employs the mean square of the residual as an estimate of Verror, the mean square of the main effect as an estimate of Vobserved, and substitute each into equation (4) or (5) (for computational examples, see Kerlinger, 1973, pp. 447-451). However, this is an operational approach to estimating reliability, the area to which attention is now given.

Operational Definitions

Operational definitions indicate what must be done to find out the value of a conceptual variable in a given empirical instance (Runkel and McGrath, 1972, p. 150). As such, reliability has three basic forms of operational definitions. Each of these operational definitions have in common the goal of deriving sets of scores from the "same" test administered to the "same" sample for the purpose of correlation to find rtt, the reliability coefficient (Guilford, 1954, p. 373). In terms of true and error variance, the logic for the three operational definitions is the same. Basically, for a set of measures to correlate highly with "itself," the only changes which could take place would be systematic changes -- either systematically higher or lower scores. These changes do not affect the rank of objects, contribute only systematic variance and thus do not affect reliability. On the other hand, differential rates of change may affect the rank order of objects, contribute to error variance, and thus lower the correlation between tests -- the reliability estimate. The basic difference between the operational definitions is what is considered as a suitable replication of the set of measures. In test-retest, the identical set of measures is given to the same sample on two separate occasions; in alternative forms, separate sets of measures are employed which are designed to be as similar as possible; in internal consistency, subsets of items within the set of measures are correlated.

Test-Retest Reliability. This form of reliability estimation applies the same measure a second time to the same set of objects under conditions as similar as the investigator can make them. The scores from the two tests are then correlated to determine a reliability coefficient. This form of reliability estimates the stability of measures over time.

The retest method is not generally recommended for four reasons. First, different results may occur depending upon the length of time between measurement and remeasurement; the longer the time interval, the lower the reliability (Bohrnstedt, 1970, p. 85). Second, if there is a change in the phenomenon between the first and second measure, there is no way to distinguish between change and unreliability (Runkel and McGrath, 1972, p. 55; see Heise (1969) for an approach to overcoming this problem). Third, there is a problem of reactivity, i.e., the initial testing may enhance the respondents' sensitivity or responsiveness to the variable under study, thus affecting subsequent measurement. Lastly, the retest correlation is only partly dependent on the correlation between items within a test. Even if the items within each testing correlated zero on the average with one another, it would still be possible to obtain a positive correlation between scores in the two testings. This is because the numerator of the correlation of sums is the sum of all cross-correlations between the two sets of items and thus includes the correlation of each item with itself. Such correlations would be expected to be much higher than those found between different items and could produce a substantial correlation between retests (Nunnally, 1967, p. 215).

Alternative Forms Reliability. This form of reliability estimation measures the same objects by each of two instruments designed to be as similar as possible. This is similar to the retest method except that a different form of instrument is used. Sum scores from the two forms are correlated and the resulting index is interpreted as both equivalence of content and stability of performance. More precise definitions of alternative form reliability have been offered. For example, Thorndike (1951) speaks of equivalent forms which are defined as tests having identical true variance and no overlap of error variance. Gulliksen (1950) speaks of parallel tests which are defined as tests having equal means, equal variances, and equal intercorrelations with one another.

Alternative forms reliability is a preferred method by some researchers particularly because of its flexibility in estimating various sources of measurement error. Comparisons between alternative forms and internal consistency estimates or by varying the test-retest period are useful for isolating sources of measurement error (see Nunnally, 1967, p. 211-213). Alternative form reliability is particularly important to estimate in cases where the phenomenon under investigation is expected to vary considerably over relatively short periods of time.

The primary limitations of alternative form reliability deal with the development of substantially equivalent alternative measures. Even more perplexing is the problem of "proving" that the two measures are equivalent. For example, if alternative forms show low reliability, there is no way of telling whether the measure has intrinsically low reliability or whether the particular alternative form has failed to be equivalent. Although it may be difficult to develop alternative forms, Nunnally (1967, p. 213) has observed that "if an alternative form cannot be constructed . . . it is doubtful that anything of importance is being measured."

Reliability as Internal Consistency. The basic form of this reliability estimation, split-halves, measures the same objects with the same instrument, splits the items in half, and compares the results between the two halves. This method is a logical extension of alternative-forms but can be applied only with measures containing multiple items or multiple trials. This method assumes that each item is essentially a replication of every other item, i.e., that the items are homogeneous or internally consistent and should thus "hang together." The items are usually split in half randomly or on an odd-even basis and correlations between the two halves estimate the internal consistency of the measure.

Internal consistency measures can avoid the difficulty of developing alternative form measures by assuming that each item is essentially a replication of every other item on the instrument. Internal consistency estimates of reliability can also avoid the problems of reactivity and other changes between the first and second measure which affect test-retest and alternative form measures. However, internal consistency estimates of multiple item scales do not explicitly consider the reliability of the instrument in terms of its stability.

Although numerous formulae are available for computing internal consistency estimates (see Guilford, 1954, pp. 376-389), Tryon (1957) has illustrated that the majority of these formulae are simply different computational forms that yield the same correct value. The best known formula is Cronbach's (1951) Coefficient Alpha (a) of which Kuder-Richardson Formula 20 is a special case --that of dichotomous items. Alpha is formulated as

EQUATION   (6)

The parts of the instrument can be as small as single items or as large as halves. In terms of correlation, a can be defined as the average correlation among a set of items or as the item-total correlation, adjusted for the number of items. Nunnally (1967, p. 196) states that a is one of the most important deductions from the theory of measurement error and that it should be routinely applied to all new tests. Further, a usually provides a good estimate of alternative-form reliability (Nunnally, 1967, p. 251). A computer program is available for calculating a in the marketing literature (Vigderhous, 1974); for a numerical example including the calculation of the standard error of measurement, see Bohrnstedt (1970, pp. 89-90).

More recently, several authors have presented other internal consistency estimates which are claimed to be superior to a. For example, Bentler (1972) working from a factor analytic format, first breaks down a variable Xi into two real valued parts Ci, the common component, and Ui the unique component. Likewise, a composite score Y is composed of both a common (C)and unique (U) component. Defining R as the covariance matrix of the random components Xi, R - U2 as the covariance matrix of the Ci and U2 as the diagonal covariance matrix of the Ui, Bentler (1972, p. 346) states that there are three major problems with a.

First, it is negative when E Cov(XiXj) is negative. Second, the implied covariance matrix (R - U2), containing Ej Cov(XiXj) in the diagonal, is generally not positive semidefinite; i.e., the components Ci are typically not real-valued. Third, it is difficult to develop a significance test to determine whether the population value is nonzero.

Bentler introduces coefficient theta which is defined as

0* = min (Var(C))/VarY   (7)

subject to

min (Var(C)), (R - U2) = FF'   (8)

which constrains theta to being positive semidefinite. Theta is determined by an interactive minimum variance factor analysis and the significance of theta is determined by a chi-square test.

REFORMULATING RELIABILITY AS GENERALIZABILITY

Discontent with traditional approaches to reliability (as well as validity), led one writer (Cureton, 1950) to title his work "Validity, Reliability and Baloney," and Cattell (1964, p. 1), in discussing reliability problems, has observed that "to the general practicing psychologist, mathematical psychometricians have sometimes seemed lost in their labyrinthine fastnesses from logic, from common sense and certainly from psychological perspective." One approach to clarifying the ambiguity in traditional reliability theory has been its reformulation as generalizability theory.

The Cronbach Approach

The basic tenet of generalizability theory is that "An investigator asks about the precision or reliability of a measure because he wishes to generalize from the observation in hand to some class of observations to which it belongs" (Cronbach, Rajaratnam, Gleser, 1963, p. 144). Interpreted in terms of generalizability theory, a researcher estimates the internal consistency of a scale to determine if the items employed are truly representative of the universe of potential items which measure the construct. If the internal consistency estimate is high, then the researcher has some basis for interpreting the items as generalizable to the universe of items which measure the construct.

In generalizability theory, every observation is regarded as a sample from a universe of possible observations. The subject's mean score over all potential observations is his "universe score" which is analogous to the "true score" in traditional reliability theory. Since the researcher cannot obtain all potential observations, universe scores can only be inferred. The question now becomes how well can the universe score be inferred from the observed scores. This is the primary concern of generalizability theory -- determining how representative an observed score is of a universe score.

Based on the assumption that persons and conditions are randomly and independently sampled from a population and universe, Cronbach, et. al. (1963, p2 146), define the coefficient of generalizability as rMpXpi' the squared correlation of score X for person p in condition i with the universe score M for person p. The universe score Mp is defined as the mean Xpi over all conditions in the universe. Although the coefficient of generalizability is no more than an intraclass correlation, the ratio of universe-score variance to expected observed score variance, the reformulation may be useful for clarifying ambiguity in traditional reliability discussions, as Cronbach, et. al. (1963, p. 156) explain:

The reinterpretation of 'reliability' theory as a theory of generalizability removes many confusions from the application of measurement theory. The semantic problems of interpreting 'reliability,' 'true score,' and 'error' reduce to mere matters of syntax when we introduce the word 'generalizability.' To speak of the generalizability of a measure is obviously an incomplete statement until the speaker indicates what construct is being generalized to; he is forced to be explicit about what has often been implicit and therefore lost from sight. The so-called error of measurement becomes a discrepancy between the measurement and a universe score.. . . Since there are many universes to which the test might be referred, the one reported coefficient does not pretend to answer all pertinent questions about the representativeness of the score.

This final notion concerning the fact that a measure can be referred to many universes is critical for understanding even traditional reliability theory. As many writers have recognized, unwanted variances come from many sources and each definition of error changes the meaning of the 'reliability coefficient' (e.g., Gleser, Cronbach, Rajaratnam, 1965, p. 396). Thorndike (1947) and Cronbach (1947) have argued that "true score" and "error" are to be defined differently depending upon the investigator's interest; for each definition a different experimental procedure must be used to estimate reliability.

Gleser, et. al. (1965) have suggested a framework for handling the problem of multiple sources of variance. First, they distinguish between a decision (D) study and a generalizability (G) study. The (D) study collects data for the purpose of making decisions or interpretations, while the G study collects data for the purpose of estimating the components of variance of a measuring procedure. By providing estimates of variance components, the G study is used in designing efficient procedures for the decision study. In the G study, each identifying aspect of the observations --time, instrument, observer, etc. -- are designated as facets. The facets, as well as the subjects, are recognized as sources of variance and depending on the conclusion the investigator is interested in, one or more of these sources of variance contribute unwanted variance or error. The design of the G study depends on the facets and the types of variance which are of interest. Both univariate and multivariate analysis of variance procedures are used to estimate each source of variance in the design. For a complete work on generalizability theory, see Cronbach, Gleser, Nanda and Rajaratnam (1972).

The Cattell Approach

Thus far in this paper little attention has been given to particular sources of unwanted variance. While the Cronbach approach to generalizability emphasizes the importance of specifying the particular universe to which the investigator wishes to generalize, Cattell and his associates take a different approach to clarifying the ambiguity in traditional reliability theory. Basically, the Cattell approach emphasizes terming some of the different sources of error as separate concepts other than reliability. For example, Cattell (1964, p. 10) points out that there are at least three possible senses in which the consistency or generalizability of a test needs to be evaluated, viz.:

1. Across occasions. The agreement on scores on the same test applied to the same people on different occasion conditions. This we shall call its reliability.

2. Across tests. The agreement on the same occasion and the same people of different subsets (or commonly, single items) in the same test. This agreement among parts of a test (or battery) designed to measure soma one thing we shall call its homogeneity.

3. Across people. The agreement in score meaning of the same test applied (on the same kind of occasion condition) to different sets of people we shall call its transferability or hardiness, in the sense of a plant withstanding changes of climate).

Thus, estimates of the true reliability will be affected by a sampling of people and occasions, of the true homogeneity by sampling of items (or test elements) and people, and of the true transferability across populations by the sampling of people from various cultures and occasions (Cattell, 1964, p. 11).

Reliability as a subconcept within consistency is further divided into several subtypes. "In practical evaluation it is extremely important to recognize their differences and to refer to them correctly by specific terms" (Cattell, 1964, p. 12). Although there are at least six different reliability coefficients (see Cattell, 1957), the three primary coefficients deal with (a) that which estimates unreliability due to having different test administrators; (b) that evaluating unreliability due to different scorers; and (c) that expressing the unreliability due to the remaining, unknown and uncontrollable conditions of the subjects' decision in the test and its situation (Cattell, 1964, pp. 12-13). Further, each of these error terms could be combined in seven different ways, each of which would be defining a distinct type of reliability coefficient. However, the three most important of these Cattell (1964, p. 13) calls dependability (immediate retest), administrative reliability (across administrators) and conspection coefficients (across scorers). Measures for each are contained in Cattell (1964).

In the Cattell framework, homogeneity deals with agreement among items or batteries within a test. As such, internal consistency estimates are interpreted as homogeneity and not reliability, and Cattell (1964, p. 10) argues that

In the first place we pay a heavy price in terms of testing efficiency, for mistaking homogeneity for reliability. Indeed, it is frequently indifferently designated reliability. A high reliability is almost always desirable, but homogeneity should be Low or high depending on purpose and test structure.

One problem with high homogeneity is that it could well impair transferability since a high homogeneity coefficient is likely to result from a narrow, specific scale. Such scale items could well not be interpreted identically across subcultures (Cattell and Tsujioka, 1964, p. 7) or other heterogeneous groups and thus decrease the transferability of the scale. This problem is discussed in Runkel and McGrath (1972, pp. 170-172) in terms of the tradeoff between standardization versus generalizability across people. For further discussion of transferability, see Cattell and Warburton (1964).

A more important problem is that high homogeneity may be achieved at the expense of validity when dealing with a factorially complex construct (Cattell and Tsujioka, 1964, p. 14). This is because items with high intercorrelation (producing high homogeneity or internal consistency) are highly redundant. Following the logic of multiple correlation, variables with high intercorrelation predict a criterion poorly because they duplicate each other in prediction. Thus, items with low inter-correlation and high item-criterion correlation are more apt to be valid. In terms of a validity coefficient, which can he defined as the ratio of the average item-criterion correlation to the average item-total correlation, increasing homogeneity increases the denominator and thus reduces the overall validity coefficient (Guilford, 1954, p. 361). In discussing this problem, Nunnally (1967, pp. 245-250) argues that this does not mean that tests should be constructed without regard to internal consistency. He argues instead that a battery of internally consistent tests should be used to predict a factorially complex construct. However, it is pointed out that in some cases, items should be selected on the basis of item-criterion correlation, for example, selecting items in terms of their correlation with a known factor of human ability or personality (Nunnally, 1967, p. 250).

RELIABILITY AND CONSUMER BEHAVIOR

Although the confusion in reliability theory is one possible explanation for the absence of serious consideration of reliability in consumer behavior, there are several other issues involved. These issues deal with the nature of consumer behavior researchers, research, and phenomena.

Training of Consumer Behavior Researchers

Most consumer behavior researchers are trained in marketing not in psychology or psychometrics. In fact, less than one percent of the members of the Association for Consumer Research are identified as psychologists (ACR News Letter, March, 1976, p. 1). Thus, although many consumer behavior researchers are also well-grounded in multivariate statistics, few have backgrounds in measurement per se. This perhaps accounts for the fact that measurement issues and instrument development in general have received relatively little attention in the area.

Training in and use of multivariate statistics has become commonplace in consumer behavior and in view of the complexity of consumer behavior phenomena, rightly so. However, the use of multivariate statistics has seemingly engendered a tradeoff in terms of study design which has not encouraged multi-item instrument development and subsequent reliability estimation. This tradeoff is between (1) using single-item measures of many variables and applying a multivariate analysis versus (2) using multi-item measures, summing scores, and applying a univariate analysis. The elegance and apparent precision of multivariate techniques may have led consumer behavior researchers toward the former rather than the latter strategy. The development of multi-item scales for many variables and applying a multivariate analysis would perhaps be a better but more difficult and time-consuming approach.

Payoff for Reporting Reliability Estimates

In examining the question of why consumer behavior researchers generally have devoted little attention to reliability, it would be remiss not to entertain the question of payoff. Reporting high reliability estimates undoubtedly enhances the perceived quality of the researcher's work. On the other hand, a quite difficult problem arises if the researcher estimates the reliability of his-measures and finds them below "acceptable" standards. The question becomes whether to report low estimates and risk non-publication or non-acceptance of the study or not report estimates and allow readers to assume acceptable reliability. It is clearly much easier to avoid this problem altogether by simply ignoring the reliability issue, i.e., accept "face reliability" as an appropriate standard.

Absolute Interpretation of Reliability Guidelines

Although there are no hard and fast rules as to what is an acceptable level of reliability, many consumer behavior researchers may be familiar with Nunnally's (1967, p. 226) guidelines. These guidelines suggest that in early stages of research on hypothesized measures of a construct, modest reliability in the area of .50 to .60 suffice; for more advanced basic research, it is argued that increasing reliabilities beyond .80 is often wasteful; in applied settings, a reliability of .90 is the minimum that should be tolerated and a reliability of .95 should be considered the desirable standard. However, an important qualification deals with viewing reliability (or other parameters by which tests or instruments are evaluated) relative to the difficulties and opportunities in a particular field. As Cattell and Tsujioka (1964, p. 23) have observed:

An examiner is unalert to realities if, as some do, he judges a test in a new promising, but little explored field, e.g., objective motivation measurement, by the same standards as in an old field, e.g., ability and achieve-merit. The mechanical application of the above formula will then lead to the unsophisticated reviewer telling his still less sophisticated audience that a given, say, motivation test is a poor job when in fact it is the best motivation measure published, and a very substantial advance in view of the available material.

Thus, as a relatively new area of inquiry, consumer behavior faces considerable difficulties and should perhaps view reliability coefficients relative to specific areas of inquiry rather than depending on absolute guidelines. Finally, it is interesting to note that although the A.P.A. Standards for Educational and Psychological Tests (1974) emphatically states that researchers should fully explain reliability coefficients and the sources of error they purport co account for, no numerical guidelines are listed.

Exploratory Nature of Consumer Behavior Research

Since consumer behavior is a relatively new area of inquiry, much of its research is of an exploratory nature. Interest centers on substantive issues and finding relationships between variables and not on measurement issues. Although from a normative research standpoint, reliable (and valid) measurement should always precede the investigation of phenomena, research in any area of inquiry seldom proceeds in such an orderly fashion, as Phillips (1971, p. 203) points out:

. . . research actually seems to proceed in a back-and-forth fashion rather than in stages. Imperfect and even very rough measurements may not be easily improved until they are actually utilized in the testing of hypotheses. It may often be preferable to proceed with rough measures, because the wait for measures may be a long one -- especially if the insights derived from actual research with the measures is necessary for their improvement.

Specificity of Consumer Behavior Phenomena

Consumer behavior phenomena such as products, services, or brands are highly specific in nature. Thus, measures developed to study specific products, services or brands are not necessarily useful for studying other products, services or brands. Although Nunnally (1967, p. 249) has pointed out that "it is very wasteful of time and money to construct tests for each new prediction problem," consumer behavior phenomena often require a new set of measures for each phenomenon investigated. For example, the attribute measures used in the study of brands of toothpaste would not be useful for studying brands of automobiles. This specificity does not encourage painstaking multi-item instrument development and reliability estimation when the measures may only be applicable to a few demand specific studies. Further, this specificity constrains the number of items from which the researcher can sample.

Relative Importance of Reliability

Although easier to estimate, many researchers consider reliability less important than validity. For example, Ligon (1975) emphasizes that validity considerations should precede reliability considerations. However, the exact relationship between reliability and validity is not clear. The point of overlap appears to be internal consistency estimates which are interpreted by some researchers as reliability estimates and by others as validity estimates (Runkel and McGrath, 1972, p. 156).

The close relationship between the two concepts can best be demonstrated in terms of correlations. If reliability is defined as the correlation between an observed score and its true score and validity is defined as the correlation between the observed score and some outside criterion, then reliability sets the upper limit on validity. More precisely, the correlation of a scale with some criterion can never exceed the square root of the reliability of the scale (see Lord and Novick, 1968, p. 72). For example, if one has a measure with a reliability of .64, that measure will never correlate greater than .80 with another variable or criterion (Bohrnstedt, 1970, p. 97). Thus, reliability and validity go hand in hand rather than one being more important than another.

SUMMARY AND RECOMMENDATIONS

This paper has been concerned with three aspects of the confusing and often confused term "reliability." First, reliability was viewed in terms of traditional approaches and definitions. Second, reliability was viewed in terms of the broader scope of generalizability theory. Lastly, reliability was discussed in relation to consumer behavior and the problems inherent in the area. Given these problems, three recommendations are offered for improving the quality of consumer behavior research.

First, since reliability is relatively easier to measure than validity, reliability estimation provides a useful starting point for developing higher quality consumer behavior research. This is not meant to imply that every study must include reliability estimates, but greater emphasis should be placed on estimating reliability and less emphasis on the absolute value of the reliability coefficient.

Second, in reporting reliability coefficients considerable care should be taken to fully explain (1) the procedure used, (2) the source(s) of measurement error which are dealt with, (3) appropriate references, and (4) the interpreted meaning of the reliability coefficient. This may help avoid the problems of ambiguity in the area. For example, based on their background and training, different researchers may refer to internal consistency estimates as reliability, as validity, as homogeneity, as generalizability, and as simply internal consistency estimates.

Lastly, in spite of the definitional problems in internal consistency, this method of reliability estimation appears to offer considerable advantages over test-retest and alternative form reliability for initial reliability estimation. Clearly, although this method necessitates the development of multi-item scales, the additional effort should increase the quality of both consumer behavior research and theory.

REFERENCES

H. W. Alexander, "The Estimation of Reliability When Several Trials are Available," Psychometrika, 12 (June, 1947), 79-99.

American Psychological Association, Standards for Educational and Psychological Tests (Washington, D.C.: APA, Inc., 1974).

P. M. Bentzler, "A Lower-Bound Method for the Dimension-Free Measurement of Internal Consistency," Social Science Research, 1 (December, 1972), 343-357.

G. W. Bohrnstedt, "Reliability and Validity Assessment in Attitude Measurement," in G. F. Summers (Ed.), Attitude Measurement (Chicago: Rand McNally, 1970), 81-99.

L. A. Broedling, "On More Reliably Employing the Concept of 'Reliability,'" Public Opinion Quarterly, 38 (Fall, 1974), 372-378.

C. Burr, "Test Reliability Estimated by Analysis of Variance," British Journal of Statistical Psychology, 8 (November, 1955), 103-118.

R. B. Cattell, Personality and Motivation Structure and Measurement (New York: Harcourt, Brace & World, 1957).

R. B. Cattell, "Validity and Reliability: A Proposed More Basic Set of Concepts," Journal of Educational Psychology, 55 (February, 1964), 1-22.

R. B. Cattell, and B. Tsujioka, "The Importance of Factor-Trueness and Validity, Versus Homogeneity and Orthogonality, in Test Scales," Educational and Psychological Measurement, 24 (Spring, 1964), 3-30.

R. B. Cattell, and F. W. Warburton, Principles of Objective Personality Testing and a Compendium of Tests (Urbana, Ill.: University of Illinois Press, 1963).

C. A. Churchill, Marketing Research (Hinsdale, Ill.: The Dryden Press, 1976).

A. J. Conger, "Estimating Profile Reliability and Maximally Reliable Composites," Multivariate Behavioral Research, 9 (January, 1974), 85-104.

L. J. Cronbach, "Test Reliability: Its Meaning and Determination," Psychometrika, 12 (March, 1947), 1-16.

L. J. Cronbach, "Coefficient Alpha and the Internal Structure of Tests," Psychometrika, 16 (September, 1951), 297-334.

L. J. Cronbach, N. Rajaratnam, and G. C. Gleser, "Theory of Generalizability: A Liberalization of Reliability Theory," British Journal of Statistical Psychology, 16 (November, 1963), 137-163.

L. J. Cronbach, G. C. Gleser, H. Nanda, and N. Rajaratnam, The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles (New York: John Wiley & Sons, 1972).

E. E. Cureton, "Validity, Reliability and Baloney," Educational and Psychological Measurement, 10 (Spring, 1950), 94-96.

C. W. Emory, Business Research Methods (Homewood, Ill.: Richard D. Irwin, Inc., 1976).

J. L. Fleiss, and J. Cohen, "The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability," Educational and Psychological Measurement, 33 (Autumn, 1973), 613-619.

G. C. Gleser, L. J. Cronbach and N. Rajaratnam, "Generalizability of Scores Influenced by Multiple Sources of Variance," Psychometrika, 30 (December, 1965) 395-418.

J. P. Guilford, Psychometric Methods (New York: McGraw-Hill Book Company, 1954).

H. Gulliksen, Theory of Mental Tests (New York: John Wiley & Sons, 1950).

D. R. Heise, "Separating Reliability and Stability in Test-Retest Correlations," American Sociological Review, 34 (February, 1969), 93-101.

C. Hoyt, "Test Reliability Estimated by Analysis of Variance," Psychometrika, 6 (June, 1941), 153-160.

D. N. Jackson, and M. E. Morf, "An Empirical Investigation of Factor Reliability," Multivariate Behavioral Research, 8 (October, 1973), 439-459.

A. Kaplan, The Conduct of Inquiry (Scranton: Chandler Publishing Co., 1964).

F. N. Kerlinger, Foundations of Behavioral Research (New York: Holt, Rinehart and Winston, 1965).

F. N. Kerlinger, Foundations of Behavioral Research, 3rd Ed., (New York: Holt, Rinehart and Winston, 1973).

M. Koslowsky, and H. Bailit, "A Measure of Reliability Using Qualitative Data," Educational and Psychological Measurement, 35 (Winter, 1975), 843-836.

G. F. Kuder, and M. W. Richardson, "The Theory of the Estimation of Test Reliability," Psychometrika, 2 (September, 1937), 151-160.

G. F. Lawlis, and E. Lu, "Judgment of Counseling Process: Reliability, Agreement, and Error," Psychological Bulletin, 78 (July, 1972), 17-20.

E. M. Ligon, "From Reliability to Validity: The Saga of a Psychologist," Character Potential, 7 (1975), 103-106.

E. F. Lindquist, Design and Analysis of Experiments in Psychology and Education (Boston: Houghton Mifflin, 1953).

F. M. Lord, and M. R. Novick, Statistical Theories of Mental Test Scores (Reading: Addison-Wesley, 1968).

D. M. Medley, and H. E. Mitzel, "Measuring Classroom Behavior by Systematic Observation," in N. L. Gufe (Ed.), Handbook of Research on Teaching (Chicago: Rand-McNally, 1963), 247-328.

J. Nunnally, Psychometric Methods (New York: McGraw-Hill Book Co., 1967).

B. S. Philips, Social Research: Strategy and Tactics, 2nd Ed., New York: The Macmillan Company, 1971).

P. J. Runkel, and J. E. McGrath, Research on Human Behavior (New York: Holt, Rinehart and Winston, 1972).

R. L. Thorndike, Research Problems and Techniques, Report No. 3, AAF Aviation Psychological Program Research Reports. U. S. Government Printing Office, 1947.

R. L. Thorndike, "Reliability," in E. F. Lindquist (Ed.), Educational Measurement (Washington D.C.: American Council on Education, 1951).

R. C. Tryon, "Reliability and Behavior Domain Validity: Reformulation and Historical Critique," Psychological Bulletin, 54 (May, 1957), 229-249.

D. S. Tull, and D. I. Hawkins, Marketing Research (New York: Macmillan Publishing Co., 1976).

G. Vigderhous, "Coefficient of Reliability Alpha," Journal of Marketing Research, 11 (May, 1974), 194.

----------------------------------------