Hierarchical Model Testing in Conjoint Analysis

J. Douglas Carroll, Bell Laboratories
Paul E. Green, University of Pennsylvania
Wayne S. DeSarbo, University of Pennsylvania
ABSTRACT - Increasingly, researchers are becoming interested in the relationship of part-worth functions, obtained from conjoint analysis, to other aspects of the respondents (e.g., their demographics, preferences for current brands, etc.). This paper describes a straightforward procedure for determining commonalties among utility functions, as related to other facets of the subject and experimental task.
[ to cite ]:
J. Douglas Carroll, Paul E. Green, and Wayne S. DeSarbo (1980) ,"Hierarchical Model Testing in Conjoint Analysis", in NA - Advances in Consumer Research Volume 07, eds. Jerry C. Olson, Ann Abor, MI : Association for Consumer Research, Pages: 688-691.

Advances in Consumer Research Volume 7, 1980     Pages 688-691


J. Douglas Carroll, Bell Laboratories

Paul E. Green, University of Pennsylvania

Wayne S. DeSarbo, University of Pennsylvania


Increasingly, researchers are becoming interested in the relationship of part-worth functions, obtained from conjoint analysis, to other aspects of the respondents (e.g., their demographics, preferences for current brands, etc.). This paper describes a straightforward procedure for determining commonalties among utility functions, as related to other facets of the subject and experimental task.

To date, virtually all applications of conjoint analysis have involved a sufficient number of preference judgments to enable the researcher to estimate utility functions at the individual-respondent level. By so doing, the utility functions can be used later in various types of simulations involving individual choice behavior.

Nevertheless, research situations can arise in which the researcher is interested in what various respondents' utility functions may have in common. For example, are their utilities sufficiently similar to be represented by a common (group-average) function? If not, what correspondences among respondents may exist between the extremes of complete individuality versus complete agreement.

In other cases--particularly commercial applications of conjoint analysis where respondent time and survey cost constraints are prevalent--the researcher may have to settle for fewer preference judgments than are needed for individual utility function estimation. Accordingly, one may wish to fit models that assume some type of partial commonality across respondents.

In both kinds of situations it is also typically the case that individual preference judgments are not highly reliable on a test-retest basis. To the extent that estimates based on various levels of aggregation are compatible with the original data, group-based parameter values should be more stable than individual-based estimates.

The purpose of this research note is to describe a statistical procedure that enables the consumer researcher to test alternative utility models in a hierarchical manner. The approach draws upon model comparison techniques in multiple regression by which a researcher can compare some "full" model with some "restricted" model in which the parameters of the latter model are a proper subset of those of the former model. [For a general discussion of model comparison procedures, see Chapter 2 of Green, with Carroll (1978). A related approach to the analysis described here can be found in Ford, Moskowitz, and Wittink (1977).] The test is designed to find out if the additional parameters in the full model account for a significant amount of additional variance in the criterion variable to warrant their inclusion.

The basic formula for carrying out these tests utilizes the F statistic:


where R2f, R2r  denote the coefficient of multiple determination for full and restricted model and df, dr denote their respective degrees of freedom. Under the usual error term assumptions, this statistic follows the F distribution, with dr - df degrees of freedom for numerator and df degrees of freedom for denominator.

Model comparison tests are not new. They have been used in such diverse areas as econometric analysis and the analysis of multidimensional contingency tables. However, their use in consumer research is still in its early stages.


Data for application of hierarchical model testing were obtained from another study (Carroll, Green and DeSarbo 1979). A sample of 46 second-year MBA students (33 males and 13 females) were asked to rate 32 profile descriptions of leisure time allocation on a 0-10 point desirability scale (see Table 1). The profiles were made up according to an orthogonal main effects plan entailing four levels each of the five activities in Table 1:

Level 1: 1 or 2 hours          Level 3: 5 or 6 hours

Level 2: 3 or 4 hours          Level 4: 7 or 8 hours

The particular number of hours chosen within level was determined randomly, subject to each of the two possible numbers of hours appearing an equal number of times, within level, over the whole set of 32 profiles.

Two classes of multiple regression models were fitted. Based on theoretical considerations, it was hypothesized that a linear-in-logs model (in which the desirability ratings are regressed on log-hours of each activity) might be an appropriate representation. In this case six parameters can be fitted--a partial regression coefficient for each of the five activities and an intercept term.

However, a more general model consists of a dummy-variable regression in which each four-level activity is coded by three dummies. In this case 15 partial regression coefficients, in addition to an intercept term, are fitted.

For purposes of illustration we first apply the hierarchy of models approach to the simpler, linear-in-logs model. This is followed by a less comprehensive examination of the dummy-variable regression model and a general discussion of the procedure.


Since each respondent supplies 32 observations (drawn from an orthogonal main effects design), sufficient data are available to estimate all parameters at the individual-respondent level. However, to motivate the approach, a sequence of five models was first fitted and tested:

1.  Model 1--a linear-in-logs model fitted to data pooled over all 46 respondents.

2.  Model 2--a model that included all of the predictors of model 1 plus a single dummy intercept term denoting the respondent's sex.

3.  Model 3--a model-1 extension that included a separate dummy-variable intercept term for each respondent.

4.  Model 4--a model-3 extension that included a slope term for each respondent.

5.  Model 5--a model-4 extension that fitted a separate linear-in-logs model for each respondent.



While models 1, 2, and 5 are well known, a few comments are in order regarding models 3 and 4. Model 3 assumes that each subject has thc same utility function as the group but allows the individual to have an idiosyncratic origin or reference point.

Model 4 assumes that each subject has the same utility function as the group but permits an idiosyncratic origin and an idiosyncratic scale unit by which the utilities are stretched or compressed to best fit the subject's data. Model 4 is not, strictly speaking, a linear model; it is a bilinear model. However, a very good approximate fit can be obtained by a sequence of linear (least squares) model fitting steps, yielding F statistics that are approximately distributed as F (with the appropriate degrees of freedom).

Model 2 Versus Model 1

As shown in Table 2, the R2 values for models 1 and 2 were 0.055 and 0.057, respectively. Substituting in equation (1), we have:




which, with 1 degree of freedom for numerator and 1466 degrees of freedom for denominator, is not significant at the 0.05 level. [In this example, the degrees of freedom are: EQUATION where n denotes the number of cases (46 x 32 = 1472), P1 denotes the number of predictors for model 1, and P2 denotes the additional dummy-variable for sex.]

By way of substantive interest, the parameter values of model 1 are:

Y = 3.098 - 0.021 log-hours (TV)

      + 0.189 log-hours (reading)

      + 0.204 log-hours (sports)          (3)

      - 0.176 log-hours (hobbies)

      + 0.777 log-hours (socializing).

A test of each partial regression coefficient indicated that all coefficients except that for TV were significant beyond the 0.05 level. Of the significant coefficients, we note that all are positive except that (-0.176) associated with hobbies. [Subsequent analysis indicated that the utility function for hobbies was of the ideal-point variety (Carroll 1972), in which preference ratings first increased slightly and then decreased rather sharply.]

Other Model Comparisons

In a similar fashion other model comparisons were made, with results also appearing in Table 2. Model 3 is straightforward (since only 45 additional dummy variables in addition to the intercept term and 5 log-hours predictors are required). As such, one obtains a single value of R2 for the complete regression.

Such is not the case for model 4 which allows both idiosyncratic origin and unit. Model 4 is fitted by first computing the regression function for the total group; see equation (3). Following this, the same 32 fitted criterion values Yi (i = 1,2,...,32), as computed for the group, serve as an independent variable in each subject's two-variable regression. Each subject's original Yi's serve as a criterion variable. Each of the 46 separate regressions yields an R2 value. However, what is needed is a single R2 that reflects variance accounted-for around the grand mean across the total sample (not around each subject's mean). This summary value, denoted by R2, is computed as follows:


where, using conventional dot notation to indicate total-group versus individual criterion-value means, we have for the k-th individual:

EQUATION    (5)  and   (6)

After R2 is obtained, this value appears as the appropriate entry for the full model in equation (1).

A similar procedure was used to obtain a single R2 for model 5. However, in this case each individual R2k is found by regressing the k-th subject's Yi's on his/her own predictor set, followed by application of equation (4) to the individual R2k's.


A less extensive hierarchical comparison was made of models based on a dummy-variable formulation of the problem. This comparison entailed three models--models 6 through 8. As noted earlier, each of the five 4-level activities can be coded into three 0-1 dummy variables, leading to a regression equation that fits 15 partial regression coefficients and an intercept term.

The R2 of model 6 is computed from data pooled over all 46 respondents. The R2 of model 8 is based on individual R2k's, found from 46 individual fits, followed by application of equation (4).

Model 7, however, is based on a different procedure. In this model we assume that each subject follows the utility of the group but with the additional freedom to exhibit different importance weights, associated with the total-group utilities. This model can be expressed as:


where bO is the k-th subject's intercept, the bj(k)'s, are his/her importance weights and Uij is the group-average utility for the i-th allocation (i ~ 1,2,...,32) of the j-th activity. The Uij's, in turn, are computed from the preliminary group-level regression of model 6 and represent the appropriate partial regression coefficient associated with the l-th level (l = 1,2,3,4) of the j-th activity.

Following computation of the 46 individual R2's, as based on equation (7),  R2 was computed by means of equation (4). Since model 7 fits only six parameters for each subject (rather than the 16 parameters fitted in model 7), it is more restrictive. As Table 2 shows, both of the tests are significant. It should be noted that in cases where the researcher is limited in terms of the number of stimuli that can be presented to the subject, model 7 needs only J+l individual observations while model 8 requires


observations, where mj denotes the number of levels of the j-th attribute (or activity, in this case).


As may be surmised, other classes of models can be fitted that may be appropriate in various applications. For example, cases can arise in which subjects are classified into a priori groups, as illustrated earlier in the case of male versus female respondents.

In the model 2 versus model 1 comparison we examined only the difference in male versus female scale origins (via the fitting of a single intercept term). Clearly, other models could be considered, such as:

1.  Differential slopes as well as intercepts.

2.  Differential saliences, as in model 7.

Moreover, models could be developed to encompass several sets of background variables--sex, age, marital status--simultaneously, if desired.

Still another class of hierarchical models that can be fitted and compared are those based on a preliminary cluster analysis of the response data. For example, assuming that each respondent receives a set of common (or "core") stimuli, subjects can be initially clustered on the basis of some convenient program like Johnson's hierarchical method (Johnson 1967). The input data may consist of Euclidean distances computed between each subject pair's response vectors, as associated with the core stimuli.

Following this, group utilities are computed for each cluster and the salience model applied for each subject in the cluster. The resulting average R2 is then compared to the average R2's associated with the subject being assigned to each of the other clusters, in turn. The subject is then assigned to that cluster for which the overall average R2 is highest.

After all subjects are so classified, group utilities are then computed for the new clusters and the assignment process is repeated until either a researcher-supplied maximum number of iterations is met or until no subject is reassigned from iteration t to iteration t+l.

Having found the average R2 associated with the C clusters one could apply equation (1) to see if this model should be accepted, versus a model based on data pooled across all subjects (albeit with significance levels to be taken with a large grain of salt, given the way in which the subgroups were formed in the first place).

Still other model comparisons could be made and only a few of the possibilities have been illustrated here. Suffice it to say that hierarchical model testing provides a very flexible approach to the study of individual or intergroup differences in utility functions. Considering the problems encountered in obtaining reliable data at the individual-respondent level and the pressing need to keep the number of stimuli that each subject receives to a manageable number, the hierarchical testing approach should see increasing application in the future. It provides a uniform approach to selecting the most parsimonious model that is consistent with the systematic variation in the data and the researcher's sequence of candidate models for testing.


Carroll, J. Douglas (1972), "Individual Differences and Multidimensional Scaling," in R. N. Shepard, A. K. Romney, and S. B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications in the Behavioral Sciences. Vol. I. New York: Seminar Press, 105-155.

Carroll, J. Douglas, Paul E. Green, and Wayne S. DeSarbo (1979), "Optimizing the Allocation of a Fixed Resource: A Simple Model and Its Experimental Test," Journal of Marketing, 43, 51-57.

Ford, David L., Herbert Moskowitz, and Dick R. Wittink (1977), "Econometric Modeling of Individual and Social Multiattribute Utility Functions," Multivariate Behavioral Research, 13, 77-98.

Green, Paul E., with contributions by J. Douglas Carroll (1978), Analyzing Multivariate Data. Hinsdale, Ill.: Dryden Press.

Johnson, Steven C. (1967), "Hierarchical Clustering Schemes," Psychometrika, 32, 241-254.