Developing Better Measures of Consumer Satisfaction: Some Preliminary Results

Robert A. Westbrook, University of Arizona
Richard L. Oliver, Washington University
ABSTRACT - Increased empirical study of consumer satisfaction has not yet been accompanied by commensurate refinements in the conceptualization and measurement of the construct. A review of the related literatures in job, health (patient), marital, and life satisfaction reveals that varying degrees of methodological rigor have been used to develop satisfaction measures, resulting in instruments as long as 80 items, but that little consensus on conceptual or measurement issues is evident. In an attempt to further satisfaction measurement in the consumer domain, a number of suggested multi-disciplinary approaches were combined in a two-product satisfaction instrument to test for response characteristics, reliability, and convergent and discriminant validity. Results in two separate pilot samples indicate that the highest reliabilities were achieved with semantic differential and Likert scales. These measures along with other verbal and graphic rating scales also demonstrated a high level of convergence and acceptable levels of discriminability.
While interest in consumer satisfaction has grown rapidly in recent years (Hunt 1977, Day 1977, Day and Hunt 1979), there is little agreement on measurement of the construct. Empirical studies reflect a variety of different measures, although typically neither the reliability nor validity of these measures has been demonstrated. Not only does this hinder the interpretation and synthesis of research findings, but it also raises the possibility that applied studies in industry and government may be obtaining inaccurate assessments of prevailing satisfaction levels. Further attention to the development and evaluation of consumer satisfaction measures is clearly warranted. Accordingly, this paper reviews alternative satisfaction measures and presents preliminary findings of a measure validation study.

Conceptualization of Satisfaction

Prior to consideration of alternative measures, attention must be given to the conceptualization of consumer satisfaction and its relation to other theoretical constructs. While satisfaction is most readily associated with the purchase and consumption of specific products or services, it may also be relevant for shopping and patronage at retail outlets, for media usage, and even overall participation in the marketplace (Czepiel, et al. 1975). Thus, consumer satisfaction refers to an evaluative response concerning the perceived outcomes of experiences in the consumer domain, comprising acquisition, consumption, and disposition activity. These outcomes are often assumed to be evaluated according to the extent to which they fulfill consumers' expectations (Howard and Sheth 1969). However, other kinds of evaluative standards are conceivable, such as correspondence of perceived outcomes to those ideally desired or to the minimum outcomes considered acceptable. The more favorable the evaluation of perceived outcomes, the greater the satisfaction.

Central to the construct of satisfaction is the presence of affect. In connection with their evaluations of outcomes, consumers may experience varying degrees of feeling or emotion. Favorably evaluated outcomes are associated with happy, pleasant feelings, and unfavorably evaluated outcomes with unhappiness, irritation, or regret. In addition, the notion of satisfaction implies some degree of conation in that the consumer is more or less inclined to repeat the behavior in question given recurrence of the situation in which it was initially performed.

With respect to theoretical conceptions of brand purchase behavior (Howard and Sheth 1969), satisfaction with a particular product occurs after purchase commitment but prior to any revision in evoked set, brand attitude, and brand (re)purchase intentions. A high level of satisfaction is believed to increase the likelihood that the brand in question will be included in the user's evoked set, increase the favorability of brand attitude, and increase the degree of intention to (re)purchase that brand (Oliver 1980). Low levels of satisfaction, or dissatisfaction, presumably have opposite effects and tend to motivate the consumer to seek redress or remedy by complaining to the seller or third parties.

Measurement of Consumer Satisfaction

Most research on consumer satisfaction has been concerned with specific products and services (Day and Ash 1979, Swan and Combs 1976, Andreasen and Best 1977, Oliver 1980) or retail outlets Miller 1976). Satisfaction with other facets of consumer experience have been studied only infrequently (see Lundstrom and Lament 1976, Westbrook and Newman 1978). Typically, measurements of satisfaction with products/services and retailers are based on direct subjective estimation by consumers of the intensity or frequency of overall satisfaction experienced. Most often, simple, single-item rating scales are employed. There has been little uniformity in the number of scale steps used or nature of verbal anchoring, however; they range from 3-point fully-labeled rating scales to 10- and 11-point variants labeled only at the extremes and midpoint. Comprehensive measure comparisons have seldom been undertaken, and investigators rarely report the reliability, much less the validity, of their measures.

Multi-item rating scale measures of product/service/retailer satisfaction have found application infrequently, despite their potential to reduce measurement error. In one of the few instances, Oliver (1980) used a set of six Likert-format items dealing with overall satisfaction, for which a high level of internal consistency was observed (alpha = .82). Multi-item measures based on satisfaction ratings for individual product/service/retailer attributes have been conspicuously avoided (for an exception, see Harris 1977), most likely because of uncertainty as to the functional form in which the latter should be combined into overall satisfaction judgments.

In view of the complexity of the construct of consumer satisfaction, the predominant approach to measurement reflected in the literature may be naive. It is doubtful that the cognitive-evaluative, effective, and conative elements of satisfaction can be adequately captured in a single 5- or 7-point "very satisfied -- very dissatisfied" rating scale. If anything, such simplistic measures may be biased because of too few scale increments and the absence of explicit evaluative and effective anchoring along the scale continuum. Andreasen (1977) reports some evidence that a &-point rating scale overreported satisfaction compared to judgments based on various open-ended questions. Moreover, satisfaction studies have typically resulted in "bunching" of respondents at the upper end of the satisfaction continuum.

More comprehensive measures of consumer satisfaction are clearly needed. In searching for such measures, it is instructive to examine methods of measurement employed in studies of satisfaction with other domains beyond consumption. Accordingly, the following section presents a brief review of the major measures of job satisfaction, life satisfaction, marital satisfaction, and patient satisfaction.


Job Satisfaction

Of other disciplines which have studied satisfaction, none has a longer tradition than management/industrial relations. Job satisfaction has been studied intensively for some fifty years, and a substantial literature has accumulated (for a recent review, see Locke 1974). The predominant measurement strategy involves direct verbal self-reports on rating scales of various forms. Often these are simply single- or duel-item overall satisfaction ratings, but a number of complex instruments examining the various facets of satisfaction have also emerged. In a review of the simpler measures, Robinson et al. (1969) find most promising two rating scales, neither of which involves an estimation of satisfaction per se. One is the self-anchoring Ladder Scale, which depicts a 9-step ladder whose bottom rung is labeled "worst job I could expect to have," and whose top rung is denoted "best job I could expect to have." Respondents select the rung that best describes their feelings about their job. The second measure asks "if you had the chance to start your working life over again, would you choose the same kind of work as you are doing now?" Responses may be made along a subjective likelihood continuum, or simply recorded verbatim for later categorization by coders along a satisfaction continuum. While these simple items may suffice to identify relative differences in satisfaction, accurate assessment of the absolute level of satisfaction is generally recognized as requiring complex instruments such as the Job Description Index (Smith, Kendall and Hulin 1969). The latter in particular has been impressively developed and validated by its authors.

Other measurement strategies in job satisfaction involve (1) inferring satisfaction from measurements of its presumed causes, (2) semi-structured interviews, and (3) recall of critical incidents. The inferential approach (Porter 1962) measures satisfaction as the inverse of the sum of discrepancies between how much of each job aspect an employee feels he is getting and how much he thinks he should be getting or how much he ideally wants. The semi-structured interview approach, while less efficient and objective than other approaches, has much to commend it; Locke (1974) encourages its wider use in satisfaction assessment. Finally, the critical incident recall method developed by Herzberg et al. (1959) is less useful as a measure of how much satisfaction was experienced than as an indicator of the sources of those feelings.

Marital Satisfaction

Marital satisfaction has received considerable attention from home economists and family sociologists. While simple "very satisfied" to "very dissatisfied" self-report rating scales have been used, there is widespread use of multi-item instruments of varying degrees of sophistication. Perhaps best known are the Blood and Wolfe (1960) .Marital Satisfaction Index and the Locke and Wallace (1959) Marital Adjustment Test. The former is a four-item scale involving evaluations of the major domains on married life using a fully anchored, five-point effective response scale, ranging from "pretty disappointed" to "enthusiastic, couldn't be any better." The majority of measures of marital satisfaction involve self-reports, though some are relatively indirect. A review of these measures (e.g., see Straus 1970) reveals considerable variability in the method of measure construction, including a variety of Likert-type summated rating scales, Guttman Scales, and judges' ratings of responses to unstructured projective questions. Noteworthy of marital satisfaction measurement is the common practice of obtaining an individual's evaluation of and effective responses to various component areas of marriage, e.g., spousal understanding, love and affection, companionship, etc., and combining these into an overall measure. Also evident is the attention given to considerations of measure reliability and validity (e.g., Rollins and Cannon 1974).

Patient Satisfaction

Research on patient satisfaction with health care delivery has been recently reviewed by Swan and Carroll (1979). Many investigators in this area have developed their own ad hoc measures which are often simplistic "satisfied--dissatisfied" self-report single item rating scales, either with the overall health care received or with a few selected aspects such as physician attitude, professional competence, convenience, etc. Noteworthy, however, are several major efforts to develop measures based on standard psychometric methods of attitude scale construction. Hulka et al. (1970) used Thurstone's Method of Equal Appearing Intervals for Scaling Attitudes toward physicians, but in a follow-up study (Zyzanski et al. 1974), the researchers revised the scale to a summated Likert-type to obtain improved internal consistency. The content of the 42-item scale suggests that it measures generalized satisfaction with the overall domain of primary health care rather than evaluation of a specific experience. In a similar effort, Ware and Snyder (1975) developed an 80-item Likert summated scale which has found application in several other patient satisfaction studies. One of the 4-item subscales of this measure is termed "general satisfaction'' and is reported to have an alpha internal consistency of .77. Again, however, this measure is a highly generalized satisfaction indicator. Finally, Mangelsdorff (1979) developed a 19-item scale for measuring patient satisfaction with a specific health care service received. Individual satisfaction ratings for various aspects of the service are made on a five-point scale and cumulated into an overall index score. The scale has demonstrated high internal consistency and some degree of validity.

Life Satisfaction

One recent sociological study of the perceived quality of life has resulted in perhaps the most sophisticated satisfaction measure development procedures to date. Andrews and Withey (1976) identified a large number of alternative satisfaction measures, which they broadly categorized according to the perspective of evaluation (absolute vs. relative, long range vs. short range), generality (general vs. specific focus), and range (full-range of experiences vs. part-range). In the category of absolute, general, full-range measures, which are of most direct relevance to this review, a number of distinct rating scales measures were examined, including a 7-point "completely satisfied--completely dissatisfied" item, a 7-point fully anchored "Delighted-Terrible" item, and a variety of graphic or nonverbal items. The latter included the self anchoring Ladder Scale previously noted, along with the Faces Scale (Kunin 1955), the Thermometer Scale (warm to cold feelings), and the Circles Scale, which is comprised of nine circles each containing some proportion of pluses and minuses to indicate the incidence of favorable and unfavorable evaluations. In comprehensive measure evaluation and validation effort, Andrews and Withey concluded that the Delighted-Terrible Scale was the most useful, yielding high construct validity (estimated at .8), reasonably symmetrical response distributions, and ease of use. The Circles and Faces scales were ranked second on these criteria, followed by the Ladder and Simple Satisfaction scales. The authors suggested that selected combinations of these single item measures might profitably be employed as indexes yielding even higher validities.

Andrews and Withey also evaluated two other types of measures of interest, a Social Comparison rating, in which respondents assessed their satisfaction with life vis-a-vis other persons they knew, and a Peer Rating, in which three other persons' ratings of an individual's life satisfaction were averaged. Interestingly, neither of these measures attained appreciable validity coefficients and were not recommended for life satisfaction assessment.

Implications for Consumer Satisfaction

The satisfaction literatures in each of the above disciplines reflect three principal common elements: (1) the development of diversity of methods for measuring the construct; (2) widespread use of multiple item scales or index measures; and (3) consistent attention to issues of measure evaluation and validation. Consumer satisfaction research would be well advised to adopt similar measure-meat traditions. Many of the specific approaches to measuring satisfaction with job, spouse, health care, or life enumerated above also appear potentially suitable for application to product satisfaction assessment.

Job satisfaction measurement suggests the usefulness of two particular kinds of single-item rating scales: the self-anchoring Ladder Scale and the graphic Faces Scale. In addition, the value of a "behavioral tendency" item, in which the respondent's predisposition to repeat his previous behavior is assessed, seems apparent. Less direct self-report measures, particularly the inferential measures in which the extent to which disparities exist between desired outcomes and actual outcomes, also appear to have application. Finally, Locke's (1974) recommendation that open-ended interview data supplement the ubiquitous rating scale, is deserving of further consideration of consumer satisfaction researchers. In fact, Miller (1976) has previously advocated the use of free response formats.

Though not unique to marital and patient satisfaction research, measurement in these areas is often characterized by multi-item scales created by summing evaluations dealing particular aspects of the phenomenon. This approach might be fruitfully applied in the assessment of product/service satisfaction, provided that agreement can be reached on the basic aspects or outcomes involved in consumption. A start in this direction is the distinction between expressive and instrumental product outcomes (Swan and Combs 1976).

Life satisfaction research, in addition to providing an excellent model for measure validation studies, suggests the value of the Delighted-Terrible scale for a wide range of satisfaction assessment applications. Also promising is the Circles graphic scale. Both these rating measures were shown to be superior to simple "satisfied--dissatisfied'' overall rating scale items typical of consumer satisfaction research. Life satisfaction studies also suggest the importance of temporal perspective for consumer research, which is rarely considered in the assessment process.


To appraise the suitability of selected measures from ether disciplines in the assessment of consumer satisfaction with specific products and services, the authors undertook a pilot study whose preliminary results are presented below. This effort is part of a broader investigation currently in progress to develop and evaluate improved indicators of satisfaction with products and services.


Five types of measures were considered in this study, all rating scale variants. The first is a three-item verbal scale (hereafter VERBAL) combining separate overall assessments of product satisfaction using the Delighted-Terrible 7-point rating scale, a "Completely Satisfied--Not at all Satisfied" 11-point rating scale ranging from 100% to 0%, and a behavioral tendency 11-point rating scale ranging from "Certain I'd do it again" to "No chance I'd do it again." The specific items constituting the overall scale, which can be found in Andrews and Withey (1976, Appendix A), were combined additively to product verbal. These items and all others discussed subsequently are available on request from the authors.

In contrast to the verbal orientation of the first measure, the second involved four distinct graphic rating scales: Faces, Thermometer, Circles, and Ladder. These particular items represent a desirable mixture of nonverbal content. The faces and thermometer scales are chiefly effective, while the Circles and Ladder are more cognitive-evaluative in tone. They were combined additively to yield the overall graphic scale (hereafter GRAPHIC). As before, these items can be found in the Andrews and Withey Appendix and are available from the authors.

The third measure was a Likert summated scale (hereafter LIKERT) in which 12 statements indicating varying sentiments of overall satisfaction with the product were presented to respondents for their agreement. Responses were made on a five-interval "strongly agree--strongly disagree" continuum.

The fourth measure consisted of a set of seven semantic differential items, again dealing with various means of expressing overall satisfaction judgments. To reduce the cognitive strain on respondents only five intervals were used on the semantic differential instead of the conventional seven. These items were combined into a simple additive index (hereafter S-D) after scoring responses as to their favorability.

The final measure was intended as an inferential satisfaction instrument. For each of a variety of product attributes, it assessed (a) the level currently provided by the product and (b) the level ideally desired by the consumer. A difference was computed between each of these ratings and summed over all attributes. This "disparity" figure was presumed inversely related to the level of satisfaction experienced. This scale was termed the PORTER scale.

Sources of Data

Self-administered questionnaires were completed by upper-level undergraduate students at the University of Arizona (N1 = 68) and Washington University (N2 = 107). Cooperation was voluntary and anonymous in both cases. The Arizona sample was surveyed in-class and all subjects agreed to participate. The Washington University survey was to be completed out-of-class; 82% of the students returned their questionnaires. Completion of the instrument took approximately 30 minutes. Respondents were questioned about their experiences with two products currently owned, automobiles and hand-held calculators. Data were analyzed separately by sample.


Descriptive statistics for each measure as applied to each product are shown in Table 1. Ideally the distribution of a satisfaction measure should exhibit a high level of dispersion of responses, thus avoiding "clumping" of respondents within a narrow range of scores. At the same time, a reasonably symmetrical distribution shape is also desired. Examination of the standard deviation of each measure reveals that the LIKERT Scale achieves the greatest dispersion of individual scores for both products, in both samples. The skewness statistics indicate that the VERBAL and LIKERT measures achieve the most symmetrical distributions for automobiles, while for calculators it is the GRAPHIC and LIKERT measures.



The reliabilities of the various measures as estimated by their internal consistencies are shown in the diagonals of the multitrait-multimethod matrices of Tables 2a and 2b. For the various automobile satisfaction measures, the LIKERT and S-D Scales both attain very high alpha coefficients in both samples, exceeding .93. The VERBAL Scale reliability is also satisfactory in both samples, a = .72. In contrast, the PORTER Scale had the lowest internal consistency of all measures in both samples (a =.68 and .46). For calculators, the highest internal consistency is attained by the GRAPHIC, LIKERT and S-D Scales, while the others yield somewhat lower though roughly equivalent reliabilities.

Campbell and Fiske's (1959) criteria of convergence and discriminability represent necessary though not sufficient conditions for measure validation. Convergence is demonstrated by high correlations between alternative measures within a given trait. Tables 2a and 2b indicate that there is a high level of convergence among satisfaction measures for automobiles with the exception of the PORTER Scale, which in both studies failed to correlate highly with all of the other measures. With regard to calculators, all measure intercorrelations are significant, although there is some variability in the strength of the relationships. In contrast to the automobile ratings, the PORTER Scale does indicate convergence with the other satisfaction measures when applied to calculators.

Discriminability also requires that different measures of different traits not correlate. In Sample I (Table 2a), none of the heterotrait-monomethod correlations and only two of the 20 heterotrait-heteromethod correlations reach significance. The significant trait-method pairs are the VERBAL--PORTER and the LIKERT--PORTER.

In Sample II (Table 2b), none of the heterotrait-monomethod or heterotrait-heteromethod correlations reach significance. In addition to the lack of convergence displayed by the PORTER Scale for automobile satisfaction, its lack of convincing discriminability opens its validity to question.


The results of this pilot study suggest that selected satisfaction measures derived from parallel disciplines may have merit as indicators of the level of consumer satisfaction, thereby warranting further attention by researchers in this area. Perhaps most importantly, however, the results also provide much needed evidence as to the validity of satisfaction measures for products and services. Overall, the various multi-item rating scale measures examined in this study appeared to perform reasonably well. All but one of the measures (PORTER) clearly met Campbell and Fiske's (1959) criteria of convergence and discriminability and, in addition, attained high levels of internal consistency reliability. The foregoing, however, may be viewed as necessary though not sufficient conditions for inferring measure validity. Further analysis of the multi-trait-multimethod matrix based on path-analytic conceptualizations and confirmatory factor analysis would be helpful by providing specific estimates of proportions of valid (i.e. trait) variance and correlated errors (i.e. method) variance.

While it is not appropriate to attempt to identify a single "best" measure given the limitations of the samples in this pilot study, the LIKERT, S-D, and VERBAL measures appear promising for automobile satisfaction measurement. Their internal consistency is high, they converge with other measures, and they succeed in discriminating between unrelated constructs; at the same time, they also yield fairly symmetrical, dispersed distributions of individual responses. The shorter length of the VERBAL Scale (3 items) makes it attractive from the standpoint of administrative efficiency. For calculator satisfaction, however, the VERBAL Scale does not appear particularly as promising as LIKERT and S-D measures. Whether this finding indicates that different classes of products or services may require different methods of measurement or whether it simply reflects the vagaries of a small pilot test must await further data collection and analysis.

Perhaps the most problematic of the measures studied was the PORTER Scale, an inferential satisfaction measure based on a summation of disparities between product outcomes and those ideally desired. Its lack of convergence for automobiles is troubling, and suggests that not all of the relevant product outcomes may have been identified for inclusion into the scale. Of course, this potential limitation is not necessarily unique to the PORTER Scale, but rather applies to all attribute-based composite satisfaction measures.





This research has focused on more or lass explicit rating scale methods for satisfaction measurement. Overall, these measures have been observed to work reasonably well in two separate samples. Future research comparing these methods to less structured methods of measurement, notably those based upon open-ended questions, would be especially helpful. Such data realistically will require collection by personal or telephone interview rather than self-administration. However, as Locke (1974) has argued, in the context of job satisfaction research, they may provide considerably deeper insight into the meaning of consumers' evaluation and sentiments. Andreasen (1977) has indicated that consistently lower estimates of satisfaction are obtained from free-response data. Which measurement technique is the more accurate indicator of consumer satisfaction, however, remains a question for further research.


