# Beyond Conjoint Measurement: a Method of Pairwise Trade-Off Analysis

^{[ to cite ]:}

Richard M. Johnson (1976) ,"Beyond Conjoint Measurement: a Method of Pairwise Trade-Off Analysis", in NA - Advances in Consumer Research Volume 03, eds. Beverlee B. Anderson, Cincinnati, OH : Association for Consumer Research, Pages: 353-358.

This paper discusses data gathering and estimation procedures which overcome some of the problems encountered in conjoint measurement applications, and which lie on the boundary between conjoint measurement and more traditional techniques.

BACKGROUND

Conjoint measurement has had a substantial impact in the relatively brief period since its introduction to the consumer research community (Green and Rao, 1971). One reason for this appears to be its capability of producing relatively "sophisticated" results, typically scaled at the interval level, from rather "primitive" data, normally consisting merely of rank order or paired comparison preference data. This strength allows the researcher to make reasonable predictions of choice behavior, even at the level of the individual consumer.

As we shall see, it is not a simple matter to decide in some borderline cases whether a particular procedure belongs to the class of conjoint measurement procedures or not. However, conjoint measurement procedures, as used in the study of consumer preferences, appear to have certain elements in common.

The first common element is that objects are conceptual[zed as consisting of "bundles of attributes" or, more precisely, bundles of attribute __levels__. As a method of collecting data we present several such bundles of attribute levels to respondents and gather information concerning overall preference among them.

The second common element is that of a "composition rule" according to which consumer preference is assumed to be affected by a number of unobservable variables. Frequently we assume that a consumer has a personal utility value associated with each level of each attribute, and that his degree of liking for a particular product is composed in some way from the utilities of its individual attribute levels. The composition rule most frequently assumed is the simple additive one, in which the overall utility of a product is assumed to be the sum of the utility values of its attribute levels.

The third common element is the estimating procedure by which an individual's utilities are inferred from preference data. Usually in consumer research the respondent is presented with a number of hypothetical product concepts and asked to rank them for preference. We ordinarily assume that his observable rank order of preference is monotonically related to the sum of his unobservable overall utilities for the attribute levels represented in each hypothetical concept. The estimation problem is then simply that of finding estimates of individual attribute utilities such that when subsets of these are summed, these sums have the indicated rank order. Thus, specifically, the third element common to conjoint measurement applications has been estimation by nonmetric or "merely-order-preserving" algorithms, such as Monanova (Kruskal, 1965), monotone regression (Johnson, 1975), or Ii-near programming (Srinivasan and Shocker, 1973).

We shall review data collection methods which have been used in consumer research applications, and propose a new method which overcomes certain practical problems with existing procedures. This new method leads naturally to other estimation techniques which will subsequently be described.

METHODS OF DATA COLLECTION

Within the field or consumer research, two rather dissimilar methods of collecting data have been used extensively, although several studies have used variations and mixtures of these. In order to distinguish among these as well as to describe the new method to be proposed, we shall consider an example.

Suppose we were studying consumer preference for table model radios, and we were to agree that such products could be characterized by six attributes, having the levels shown in Table 1.

The numerical values in Table 1 are utilities for a particular respondent. These are not usually known, of course, since our purpose is usually to estimate them from individual preference data. With an additive composition rule we are free to add or subtract a constant from the values for the various levels of any attribute. For simplicity these utilities have been arbitrarily scaled so that the least liked level for each attribute has a value of zero, and so that their maximum is 100.

One method of gathering preference data involves presenting the respondent with a number of imaginary product concepts, where each concept has a specified level for __every__ attribute. Data collection methods of this type have been advocated by Green and Rao (1971), Green and Wind (1973), and Green and Devita (1975). The distinguishing aspect of this method is that it presents the respondent with entire concepts, specified with respect to __all__ attributes. A/though there seems to be no name consistently associated with this approach in the literature, we shall call it a "concept evaluation" method. For example, four imaginary radios that might be presented axe shown in Table 2, together with a calculation of overall utility for our respondent.

The estimation problem is usually that of inferring a set of individual attribute utilities from observed preferences. In this example we have turned the process around to show how overall utility for each radio would be computed, given the individual attribute utilities in Table 1. Since Radio A has the highest utility it should be preferred to the others, which should in turn be ranked in order of their overall utilities.

OVERALL UTILITY CALCULATIONS FOR 4 HYPOTHETICAL RADIOS

It is not clear how many concepts must be presented to a respondent in order to permit robust estimation of his utilities, but it would seem prudent that the number of concepts presented be on the order of three or four times the number of parameters being estimated. If there were six attributes with a total of 19 levels, as in the example, then the number of parameters being estimated is 19 -6 -1 = 12 (since we can arbitrarily set one value at zero for each attribute, and can scale all values to have arbitrary maximum). Therefore we might wish to expose the respondent to 40 or 50 concepts, each of which has a specified level on each of the six attributes. If we were dealing with I0 attributes, each with 4 levels, we might wish to present the respondent with ninety or more concepts, each specified with respect to ten attributes.

A second way of gathering data, called "Trade-Off Analysis" by Johnson (1972), has also been described by Fiedler (1972), Davidson (1973), Ross (1975), and Johnson (1975). The same technique appears to have been developed independently by Westwood, Lunn, and Beazley (1974).

This approach presents the respondent with attributes two at a time and asks for a rank order of preference for all the imaginary product concepts that could be described as combinations of their levels. The task is frequently made more systematic by presenting the attributes as rows and columns of a "Trade-Off Matrix," as shown in Table 3. The respondent would see Table 3 with no numbers, and fill in his rank orders in the cells. The cell entries are rank orders of preference that would be expected based on this respondent's individual attribute utilities for Type and Clock as shown in Table 1 and repeated in Table 3. The cell values in parentheses are sums of row and column utilities, and the preference ranks are consistent with these. The estimation problem ordinarily consists of finding a set of individual attribute utilities so that their pairwise sums, such as indicated in the cells of Table 3, have the desired rank orders.

In a study involving six attributes, as in the example, the respondent could be shown all fifteen of the possible matrices. With I0 attributes it would be more difficult to expose the respondent to all possible pairs, but one might pair each attribute with a subset of others so that about 20 matrices were filled out. If all attributes had three levels this would involve the presentation of 9 x 20 = 180 stimuli to each respondent, about 9.5 times the number of parameters to be estimated. A procedure for choosing the specific pairs of attributes to be presented is suggested by Johnson and Van Dyk (1975).

LIMITATIONS AND SHORTCOMINGS

A virtue of the concept evaluation method is that the respondent's task is relatively "realistic," since the stimuli to which he is reacting are complete concepts and he is not required to maintain an "all other things being equal" frame of mind. However, this realism is bought at the price of a considerable burden in interview length and task complexity when the number of attributes is greater than six or eight. Indeed, the magnitude of the respondent's task increases quadratically with the number of attributes, since larger numbers of attributes require not only that more concepts be presented, but also that each be more elaborately specified. By contrast, the trade-off matrix method has a clear practical advantage in studies with many attributes, simply requiring that the respondent fill out more matrices, a task that increases more nearly linearly as the number of attributes increases. Using the matrix format, studies involving a dozen attributes are feasible if the number of levels is around forty.

However, the matrix approach has a disadvantage which is avoided by the concept evaluation approach; it supplies a high degree of potentially artificial structure to the task. Since the stimuli are arranged neatly in rows and columns it is possible for the respondent to adopt a superficial "patterned" mode of responding by simply rank ordering the matrix cells from left to right or top to bottom. This behavior is observed particularly with respondents who profess a low degree of interest for the task or who find themselves overcome by its rather abstract nature.

A different but equally damaging type of lexicographic behavior is also observed with the concept evaluation approach. Here respondents who are disinterested or overloaded by the large amount of information provided for each concept sometimes narrow their focus, forming their preferences on the basis of presence or absence of one or two prominent levels.

It is clear that both tasks are comparatively difficult for respondents unused to thinking abstractly or processing large amounts of verbal information. We therefore have considerable motivation for finding data collection methods which provide the input for conjoint analysis at the level of the individual respondent while still being simple enough to be appropriate for the wide variety of respondents encountered in consumer research studies today.

PAIRWISE TRADE-OFF ANALYSIS

When using conjoint measurement with an additive composition rule, the role of the data may be regarded as that of providing inequalities which the calculated utilities should satisfy. For example, if a respondent prefers an AM/FM radio with no clock to an AM only radio with a clock, we know that the sum of his utility for AM/FM and his utility for no clock should be greater than the sum of his utility for AM only and his utility for a clock. Similarly, the statement of a preference for one concept over another, when they are both specified with respect to six attributes, merely provides one inequality, although it involves sums of six values each.

It is clear that the minimal task that a respondent can perform while providing inequality information about his underlying utilities is to state which of two cells in a trade-off matrix he prefers. Of course, it is not necessary that he actually be shown a matrix. Individual pairwise questions may be asked in a format like the following.

There is no reason why each stimulus element need consist of only two attribute levels, since this format could as well be used to elicit preference information for concepts specified on many attributes. To do so, however, would require more reading and thought on the part of the respondent and would still result in only one inequality per question.

The conceptual simplicity of the pairwise task renders it capable of being used productively with respondents from a wide range of socioeconomic levels, and our empirical research to date has shown that it produces data which are more consistent internally and of higher quality in general than either the concept evaluation or the matrix format approaches. However, this simplicity is obtained at the price of decreased efficiency in data collection. Pairwise trade-off interviews tend to be longer and somewhat more tedious than those using the matrix format approach.

A rank ordering of N stimuli by the concept evaluation method produces N (N-1)/2 inequalities. A/though only N-1 of these are independent, and thus responses to N-1 pairs could theoretically produce the same information under ideal circumstances, this will scarcely ever happen in practice. Such an ideal situation could occur only when a respondent's preferences are known before the pairs to be presented are selected. This might be approached in interactive data collection environments, where pairs presented late in the interview could be selected on the basis of preferences revealed earlier, but it is unlikely to occur when questionnaires must be printed in advance of the interview.

In comparison to the matrix format, it is conceivable that two pairwise questions could provide as much information as a ranking of all cells in a 3 x 3 matrix. However, this will only occur if one attribute dominates another and we are fortunate enough to present the appropriate pairs. In general more information results from having a respondent rank matrix cells than can be obtained from the pairwise approach with reasonable interview length.

This means that the pairwise approach is not really feasible in studies with large numbers of attributes, where matrices would probably be preferred if the respondents are able to cope with the more demanding task.

SPECIAL CONSIDERATIONS FOR DATA COLLECTION

The pairwise trade-off approach removes a number of restrictions which have previously applied to applications of conjoint measurement. The technique can be self-administered successfully, which makes it feasible for mail panel research, airline in-flight surveys, group administration, etc. If the researcher is willing to forego analysis at the individual level and can divide the stimuli among equivalent subsamples of respondents, the technique becomes feasible for telephone interviewing as well. In the limit, each respondent could be presented with one pair.

The questions of how many pairs to present and how to select these are complex, and the answers are not fully known. There are a few guidelines, however, which help in designing a questionnaire.

First, the number of pairs presented should probably at the very least be about three times the number of parameters to be estimated (number of levels minus number of attributes minus one). This is only the crudest of guidelines, however, since "easy" pairs contribute less information than harder pairs. The value of hard, as opposed to easy, pairs must, however, be considered in the light of respondent fatigue and attitude. It may be true that a sprinkling of very easy pairs will help keep the respondent motivated.

Second, it is essential that the design have a high degree of "connectedness" (Johnson and Van Dyk, 1975). While it is not necessary that every possible combination of attribute levels appear, it is certainly essential that every level appear in comparison with enough others to permit indirect comparisons of all differences not actually presented to the respondent.

The most interesting method of collecting data involves a computer-interactive environment where the computer administers the pairs, always presenting one which is nearly optimal in the sense of providing needed information given the pattern of responses obtained thus far. My colleague Frank Goode will discuss such an approach shortly.

COMPUTATIONAL CONSIDERATIONS

A surprising aspect of pairwise trade-off analysis is that, depending on how one views it, the estimation of utilities may be considered as an example of conjoint measurement, classical statistical inference, linear optimization, or stochastic modeling. Let us consider each viewpoint briefly.

In each case we shall conceptualize the computational problem as follows. Suppose that a respondent is given N stimulus pairs. Let there be n attributes, possessing a total of m levels. Consider a "design matrix" X of order 2N by m, with elements of zero or one. The rows of X are considered in pairs, and labeled 1l, 1r, 2l, 2r, .... etc. The first row of each pair is associated with the left-hand element of that stimulus pair, and the second row of each pair is associated with the right-hand element of that stimulus pair. A given row of X is entirely zero except for ones in the positions corresponding to attribute levels possessed by that element of a stimulus pair. Consider also a vector Y of length 2N with rows corresponding to those of X and which contains values of +1 for the preferred element of each pair and -1 for the nonpreferred element.

1) Estimation by Conjoint Measurement

One way to conceptualize the estimation task is as that of finding a weight for each column of X so that the weighted row sums, which may be regarded as a "prediction" of the vector Y, are as close as possible to Y in a particular sense. Consider the weights as elements of a vector W of length m. Then let XW = Y. We simply wish (y_{ir}-y_{il}) to have the same sign as (y_{ir}-y_{il}). This problem is none other than the familiar monotone regression problem, where order comparisons are restricted to subsets of size two. The computation may be handled by any monotone regression algorithm (Kruskal, 1965 or Johnson, 1975).

2) Classical Statistical Inference

We may regard the columns of X as "independent variables" and the vector y as a "dependent" variable and use ordinary least squares procedures to find a weighting vector W. It is perhaps aesthetically preferable to regard this as a two-group discriminant analysis to which it is algebraically equivalent. In either case a set of column weights is sought, so as to maximize the difference between mean overall utilities for preferred and nonpreferred stimulus elements, subject to a constraint on the magnitudes of the weights themselves.

Alternatively, we might subtract adjacent rows of X from one another to get N rows corresponding to __differences __between preferred and nonpreferred stimulus elements. This would produce a matrix Z of differences of order N by m. Then we might seek a set of weights for the columns of Z which would produce row sums as "positive" as possible, in some sense. One criterion which might be utilized is that the sum of squares of the weighted row sums be maximized, subject to a constraint on the sum of squares of the weights. This is none other than the well-known "eigen problem," and the desired weights are simply the first eigenvector of Z'Z. (This criterion will be strictly appropriate only when weighted row sums are all positive, but may be close enough most of the time.)

3) Linear Optimization

Methods of linear programming can be used in the obvious way. Each row of Z represents a constraint in a linear programming tableau with m -n -1 unknowns. We can add N "slack" variables to the system to insure nonnegativity of each row sum and seek to minimize the sum of these (Srinivasan and Shocker, 1973).

4) Stochastic Modeling

One way in which the pairwise trade-off method is unlike other conjoint measurement data collection methods is that it consists of a number of binary choice events which can, in a sense, be regarded as independent of one another. This suggests that it might be fruitful to devise some measure of fit of the model to the data which is more "probabilistic" than those which have arisen in the nonmetric scaling literature. One obvious measure may be obtained by borrowing from the tradition of maximum likelihood estimation, with use of the logistic transformation.

As before, let the elements of XW=Y be "overall utilities" for elements of the stimulus pairs. In particular, y_{il}, is the respondent's overall utility for the left-hand element of the first pair and y_{ir} is his overall utility for the right-hand element. We are interested in finding weights so that differences between adjacent elements of Y have proper signs.

Let us define the vector U of length 2N with elements

The elements of U are obtained by an exponential or antilogarithmic transformation of corresponding overall utilities, Y. The values in U are all positive, and may be regarded as "multiplicative" analogs of the "additive" values in Y.

Now, let us assume that the probability with which a respondent will choose the left-hand element of the ith pair is simply

It is easy to show that

commonly known as the logistic transformation.

Then a reasonable criterion of fit might involve a likelihood-type function of these estimated probability values. Specifically, let

Then p_{i} is the likelihood of the respondent's choosing as he did on the ith pair, given the probabilistic assumptions just stated.

The criterion we have adopted is the root likelihood, defined as

This is the geometric mean of all N of the individual pairwise likelihood values.

The title of this paper contains the phrase "beyond conjoint measurement," which is motivated by the existence of this estimating method. As pointed out above, a common element of conjoint measurement procedures has been estimation by "merely-order-preserving" or nonmetric algorithms. These algorithms produce solutions with as few "large" order violations as possible, and are satisfied by any solution which has no order violations. The likelihood criterion is quite different. It is not basically concerned with order violations but rather with accounting for data in a likelihood sense. There is no discontinuity in its "satisfaction" with a pair when the probability with which it "predicts" that response passes from below .5 to above . 5.

This criterion has several desirable properties not shared by other indices of fit commonly used in conjoint measurement studies.

1) Like the others, it can be optimized by gradient-type computing methods. Unlike some others, however, its surface appears to be smooth and not to possess local optima. This removes one of the gravest problems in large-scale applications of nonmetric estimation methods.

2) In additive conjoint measurement the solution is determined only up to an arbitrary multiplicative constant. This source of arbitrariness is removed with the likelihood criterion since the scaling of a solution depends on its degree of fit to the data. If no well-fitting solution exists the estimates of utility for various attribute levels will be quite uniform. On the other hand, if there is very little error of fit the utilities will vary greatly from one another. Thus, an individual who gives coherent, reproducible data will have estimated utilities which are scaled quite differently from those of an individual who responds less consistently.

3) Merely-order-preserving methods are satisfied with solutions which reproduce order perfectly or nearly so. Kruskall's stress and Johnson's theta both tend to "stop worrying about" an order relation when violations become small. Accordingly, these methods tend to produce solutions with many ties or near-ties in differences among estimated utilities. The likelihood criterion, on the other hand, "continues to care about" each order relation, even after it is satisfied. Whereas a contribution to the gradient is proportional to the size of an order violation in those other methods, with the likelihood criterion a pair's contribution to the gradient is proportional to 1 - p_{i}. Thus, although not actually a conjoint measurement estimating method, the likelihood criterion seems to provide __better__ fits in terms of the number of violations than the conjoint measurement methods themselves.

4) One computational output is an estimate of the likelihood that the respondent would choose either member of any possible pair. These values have considerable potential usefulness in interactive data collection. Unasked pairs with estimated p_{i} values near either zero or one would probably not be worth asking. On the other hand, unasked pairs with p_{i} values near .5 are those about which information is lacking, and these are the pairs for which additional information would be of greatest value. Similarly, pairs already presented with low p_{i} values may have been answered in error and could perhaps be reasked.

SUMMARY

Pairwise trade-off-analysis has two desirable properties. First, the task presented to the respondent is the simplest one possible: a sequence of binary choices. Common sense and experience to date both suggest that a far wider range of people should be able to perform it successfully than is true of either the concept evaluation or the matrix format approach.

Second, data of the type gathered by this technique should be analyzable by a wide variety of computational approaches, including classical least squares techniques. Experience to date suggests that the maximum likelihood approach in particular has a number of desirable properties. These are

1) Invulnerability to local optima

2) A natural and non-arbitrary scaling for the solution 3) Tendency to avoid ties among sums of subsets of utilities

4) Value as a guide to stimulus sequencing in interactive data collection.

REFERENCES

J. D. Davidson, "Forecasting Traffic on STOL," __Operational Research Quarterly__, 24(1973), 561-569.

John A. Fiedler, "Condominium Design and Pricing: A Case Study in Consumer Trade-Off Analysis," __Proceedings of the ACR__ (1972), 279-293.

Paul E. Green and Michael T. Devita, "An Interaction Model of Consumer Utility," __Journal of Consumer Research__, 2(September, 1975), 146-153.

Paul E. Green and Yoram Wind, __Multiattribute Decisions in Marketing: A Measurement Approach__ (Hinsdale, Ill.: Dryden Press, 1973).

Paul E. Green and Vithala R. Rao, "Conjoint Measurement for Quantifying Judgmental Data," __Journal of Marketing Research__, 8(August, 1971), 355-363.

Richard M. Johnson, "Trade-Off Analysis: A Method for Quantifying Consumer Values," Unpublished Manuscript, Market Facts, Inc., Chicago (September, 1972).

Richard M. Johnson, "Trade-Off Analysis of Consumer Values," __Journal of Marketing Research__, 11(May, 1974), 121-127.

Richard M. Johnson, "A Simple Method for Pairwise Monotone Regression," __Psychometrika__, 40(June, 1975), 163-168.

Richard M. Johnson, "On the Measurement of Utilities for the Direction of Public Policy," Unpublished Manuscript, Market Facts, Inc., Chicago (August, 1975).

Richard M. Johns on and Gerald J. Van Dyk, "A Resistance Analogy for Efficiency of Paired Comparison Designs," Unpublished Manuscript, Market Facts, Inc., Chicago (April, 1975).

Joseph B. Kruskal, "Analysis of Factorial Experiments by Estimating Monotone Transformations of the Data," __Journal of the Royal Statistical Society__, Series B, 27(1965), 251-63.

Richard B. Ross, "Measuring the Influence of Soft Variables on Travel Behavior," __Traffic Quarterly__, (July, 1975), 333-346.

J. Srinivasan and Alan D. Shocker, "Estimating the Weights for Multiple Attributes in a Composite Criterion Using Pairwise Judgments," __Psychometrika__, 38 (1973), 473-493.

Dick Westwood, Tony Lunn, and David Beazley, "The Trade-Off Model and Its Extensions," __Journal of the Market Research Society__, 16(July, 1974), 227-241.

---------------------------------------