Using Structural Zeros to Simplify the Interpretation of Multiway Contingency Tables With Interactions
ABSTRACT - This paper analyzes the situation in which a simple model does not adequately describe the relationships in a multiway contingency table. In such cases the typical solution is to fit models with more complex interactions until an adequate fit is obtained . In some cases, however, the researcher can use structural zeros to retain the simple model for a portion of the population and either ignore or explicitly model the relationships in the remainder of the table. This approach is illustrated in an examination of the relationship of certain demographic characteristics to telephone publication status.
Citation:
Lawrence F. Feick (1985) ,"Using Structural Zeros to Simplify the Interpretation of Multiway Contingency Tables With Interactions", in NA - Advances in Consumer Research Volume 12, eds. Elizabeth C. Hirschman and Moris B. Holbrook, Provo, UT : Association for Consumer Research, Pages: 226-230.
This paper analyzes the situation in which a simple model does not adequately describe the relationships in a multiway contingency table. In such cases the typical solution is to fit models with more complex interactions until an adequate fit is obtained . In some cases, however, the researcher can use structural zeros to retain the simple model for a portion of the population and either ignore or explicitly model the relationships in the remainder of the table. This approach is illustrated in an examination of the relationship of certain demographic characteristics to telephone publication status. INTRODUCTION Over the last fifteen years, a number of survey researchers have examined differences between the characteristics of households with published versus unpublished telephone numbers. These studies have found that differences exist on a rather large number of demographic variables. For example, households with unpublished telephone numbers are more likely to be urban and from either the east or west coasts of the U.S. and tend to have younger heads and to fall into a low-to lower-middle income bracket. Studies also indicate unpublished telephone numbers tend to be characteristic of individuals who are single, divorced, or separated. Finally, unpublished numbers tend to be more likely in households which are nonwhite, which have young children present, and in which the head has a high school education or less. A review of these results is included in Tyebjee (1979). In general, the studies which generated these results used an analysis of bivariate relationships--separately examining the effects of demographic characteristics on publication status. However, a more salient question to the practicing consumer researcher would deal with the combined relationship of these characteristics to publication status. That is, what is the probability that an individual with a certain combination of characteristics has a published telephone number? For questions of this sort the researcher would use multivariate analysis. In the analysis of publication of telephone numbers, the researcher would treat publication status as a dependent variable and probable fit a logit model to the multiway table formed from the demographic variables and publication status. Among other things, such an approach would give the researcher greater insight into the relative importance of the demographic variables as predictors and an idea of the variation in proportion published across demographic segments of the population. Perhaps implicit in the mind of the researcher is a simple main effects model--that is, the more of the characteristics individuals have that make them unlikely to publish telephone numbers, the greater would be the expected proportion of unpublished numbers. This paper analyzes the situation in which the main effects model does not adequately describe the data. A common procedure in such situations is to elaborate the model using interactions. Such a procedure will always improve model fit although at the expense of parsimony. After elaboration, model selection could then be based on a comparison of candidate models using both the fit and parsimony criteria. This paper offers an alternative approach which focuses on identifying the source of the lack of fit and retaining the main effects model for the portion of the population in which it appears tenable. This approach can offer a more interpretable model since interactions derived under a model elaboration procedure may be theoretically inexplicable and will generally impede the understanding of main effects. DATA Data analyzed were gathered by telephone interview in October, November, and December 1980 from 1,382 adult women who resided in the 48 coterminous states and the District of Columbia. Respondents for the study were limited to women between the ages of 20 and 59 who were not currently pregnant or lactating. Telephone numbers called for the interviews were selected by random digit dialing. Three call backs were made to numbers which were busy or rang with no answer. Adjusted response rate to the survey was 52.8 percent. During the interviews, respondents were asked questions about demographic characteristics and whether their number was published in a directory. About 16.5 percent of the sample had voluntarily unpublished numbers and 2.5 percent had involuntarily unpublished numbers--numbers assigned after publication, inadvertently unpublished, etc. Thus, about 19 percent of the sample has unpublished numbers in the analysis. These results are quite consistent with those reported by Rich (1977) for the Bell System and with other previously reported research (Glasser and Metzger 1972, Groves 1978, Groves and Kahn 1979). The analysis reported here is based on 1371 cases, eliminating 11 cases with missing values. For pedagogical purposes this paper analyzes the relationship between telephone publication status and age, education, marital status, and race. These variables were selected and included as dichotomies for illustrative purposes. As noted earlier, other variables are also related to publication status and could have been used instead of or in addition to the ones selected. RESULTS AND DISCUSSION Table 1 presents bivariate crosstabulations of age, education, marital status, and race with publication status. In Table 1, all four of the variables had strongly significant bivariate relationships to publication status. Individuals who were nonwhite, nonmarried, younger, or who had less education were more likely to have an unpublished telephone number. These results are quite consistent with the results reported in the literature cited earlier. As suggested earlier, however, the researcher would probably be interested in the multivariate relationship of the variables and publication status. Table 2 presents observed frequencies in the five way table and the proportion of published numbers by combinations of demographic characteristics. In Table 2, women with all of the characteristics conducive to having a published number women who were married, older, white, and better educated--had ninety-one percent published numbers. However, only thirty-one percent of the women with all of the opposite characteristics had published telephone numbers. In general in the table, increasing the number of characteristics associated with unpublished numbers tends to increase the observed proportion of unpublished numbers. To formally test the basic main effects model, one would fit a logit model to the data of Table 2. CROSSTABULATIONS OF PUBLICATION STATUS BY SELECTED DEMOGRAPHIC CHARACTERISTICS OBSERVED FREQUENCIES, PROPORTION PUBLISHED, AND STANDARDIZED RESIDULAS BY DEMOGRAPHIC CHARACTERISTICS LOGIT ESTIMATES OF VARIOUS MODELS FOR PUBLICATION STATUS Table 3 presents the results of this model fitting. In Table 3, acronyms for independent variables are the first letter of the variable they represent (i.e., M marital status, A = age, E = education, R s race, P = publication status) and the inclusion of an interaction implies the inclusion of all lower order relatives. The logit models listed in Table 3 are presented in the loglinear models style employed by Goodman (1970) but are completely equivalent to the style of presentation used by, for example, Green (1978). Interpretation of the models is straightforward. For example in Model 2, EP refers to the main effect of education on publication status and the other effects are defined similarly. The EMRA term is fit in all the models under the assumption that the joint marginal of the predictor variables is fixed. This assumption is always justified for product-multinomial sampling designs, for example, experiments in which the number of observations within treatment levels is fixed by design. It is convention to also adopt a conditional logit approach when the sampling design is multinomial and the researcher can clearly distinguish response from explanatory variables even though the number of observations within levels of the explanatory variables is not fixed. Such an approach is advocated by Cox (1970), for example. Model M1, a logit baseline model is included for comparison purposes. Model M2 is the main effects logit model fit to the data of Table 2. This model does not adequately describe the data and the researcher might, as noted earlier, be lead to consider more elaborate models. DeSarbo and Hildebrand (1980) describe elaboration procedures which could be used for model selection. A subset of the possible models to be considered is included as M3-M8. While models M6-M8 adequately describe data, the fit comes at the expense of including complex interactions. Further, it is not clear whether, for example, M6 or M7 would be preferred since in both cases the included interactions have no theoretical meaning and are only included to improve fit, i.e., to better reproduce the observed table. It is not uncommon in practice to find loglinear modeling results in which simpler models do not adequately describe the data. As noted, the common practice in such cases is to elaborate the model to find the most parsimonious model which does fit the data. For a consumer research example of this type of elaboration, the reader can see, for example, Peterson, Leone, and Sabertehrani (1981). An alternative approach to seeking model fit through data dependent fitting of interactions is to examine the main effects model for the source of the lack of fit. This can be done using standardized residuals produced under the main effects model. Standardized residuals produced under M2 are included in Table 2. An examination of Table 2 indicates that the primary source of deviation from the main effects model appears to be women who were younger, better educated, white, and not married, as the residuals for these women are comparatively large. One approach to dealing with these individuals is to assume a different model operates for them than operates for the rest of the population. Thus, rather than attempting to fit these individuals with interactions, one could fit the main effects model to all of the population except for these individuals. Such a procedure could be accomplished using structural zeros. Structural zeros in a cross-tabulation arise from logical impossibilities--men with hysterectomies, for example--rather than from sampling. The researcher would generally want to treat structural zeros in a special way since models chosen should not fit nonzero expected frequencies to such cells. The analysis of cross-classifications with structural zeros is termed incomplete table analysis and models fit to such tables are termed quasi-loglinear models. Summary treatments of incomplete table analysis can be found in Bishop, Fienberg, and Holland (1975), Fienberg (1980), and Haberman (1979). Loglinear models programs that allow the user to input a design matrix can be used to fit structural zeros; Freq (Haberman 1979) and Multiqual (Bock and Yates 1974) are two of the many programs available. In addition, both BMDP and SPSS-X include loglinear models programs which allow the user to specify structural zeros. All of the models fit in this paper used the BMDP-4F program. Model M9 treats the cells representing the younger, better educated, white, nonmarried individual s as structural zeros and the resulting improvement in fit of the main effects logit model is dramatic, L2 - 7.74 on 10 d.f. A discussion of how degrees-of-freedom were calculated for M9 is included later. In addition to the chi-square statistic an examination of residuals for this model in Table 2 suggests the model fits the data quite well. The fit of the model suggests that for most of the population, a main effects logit model is appropriate and the expected probability of having a published number can be calculated easily from the parameter estimates for the model. These estimates are: Age Young -.323 Old .323 Education 12 years or less -.197 13 years or more .197 Marital Status Unmarried -.695 Married .695 Race Nonwhite -.600 White .600 Constant 569 Thus the log of the odds of a woman having a published number if she were older, better educated, married, and white is .323 + .197 + .695 + .600 + .569 = 2.384 The expected proportion published for those individuals under M9 is calculated as: e2.384/(1 + e2.384) = .92 The expected proportions for all of the demographic combinations considered are listed in Table 2. The problem with the main effects model fit to the complete table is evident from an examination of these expected proportions. While the observed proportion published for younger, better educated, nonmarried white women was .86, the expected proportion under M9 based on the data in the incomplete table was .59. The structural zero approach thus retains the usefulness of the information available from the main effects model--the size of the main effects coefficients, for example, and provides insight into the origins of the lack of fit of the main effects model. These insights are lost when the researcher simply includes interactions in order to more accurately reproduce-observed cell frequencies. CONCLUSIONS AND COMMENTS This paper has advocated the use of structural zeros in certain situations in which simple models do not adequately describe the relationships in a data set. The use of structural zeros was illustrated on the analysis of telephone publication status for a situation in which a main effects logit model did not fit a set of data. For some situations this approach seems to be preferable to the data dependent inclusion of interactions for the purpose of achieving model fit. However, in situations in which there are a priori reasons for including the interactions, the researcher would probably not use the structural zero approach. The paper concludes with a few comments on the use of structural zeros in analyzing contingency tables. (1) The early work on incomplete table analysis was begun by Goodman (1968, 1969). This early work involved fitting models of quasi-independence to social mobility tables and represented an attempt to account for the clustering on the main diagonal which occurs in these tables. The quasi-independence model applied in mobility analysis fits the model of independence to those cells not on the main diagonal of the table. Quasi-independence models have also been applied in situations in which researchers are interested in scaling response patterns. The quasi-independence model is equivalent to a restricted latent class model and quasi-independence can be included as one of many possible models tested in a latent class framework. Illustrations of this use of quasi-independence models can be found in Clogg and Sawyer (1981), Dillon, Madden, and Mulani (1983), and Goodman (1975, 1979b). (2) Degrees-of-freedom calculations for incomplete table analysis can be quite complicated and are dependent on the model selected. In the simplest cases, the degrees-of-freedom will be the degrees-of-freedom for the model applied to the complete table minus the number of structural zeros. Fienberg (1980), lists a formula which is usefuL more generally. Degrees of freedom are calculated as the number of nonzero estimated cells in the table minus the number of estimated parameters. Table 2 contains 32 cells, however since two cells are regarded as structural zeros, there are only 30 cells with nonzero estimated frequencies, i.e., estimation is based on information in only 30 cells. There are 21 unique loglinear parameters implied by a main effects logit model applied to five dichotomous items. In M9, however, the parameters corresponding to the joint marginal of the predictors, EMRA, cannot be estimated because of the zero marginal entry, thus there are 30 - 20 = 10 d. f . for M9. (3) In general, the use of structural zeros is not the same as eliminating offending cases from the original data set, i.e., replacing frequencies in the offending cells with "sampling" zeros. Such a procedure would not, in general, guarantee that these cells would have estimated expected frequencies of zero. However, identical resuLts to M9 could have been obtained by the elimination of younger, better educated, nonmarried white women from the data set and then estimating the main effects model. For this example the results of the analysis with sampling zeros and structural zeros would be identical because of the assumption of a fixed margin for the joint predictor variable. This assumption guarantees zero expected frequencies for the cells in question, i.e., women who were younger, better educated, white, and not married. (4) Although in this paper structural zeros were used to eliminate certain cells from the analysis of a table in which a single model was tested, it should be evident that in some cases it will be possible to fit more than one formal model to a table using structural zeros to eliminate certain cells for the first model, then eliminating other cells in a second analysis with another model, and so on. (5) Finally, for pedagogical purposes this paper has assumed dichotomous predictor variables within a specified model in analyzing telephone publication status. The author does not advocate collapsing categories of variables indiscriminately; such a procedure can lead to the appearance (or disappearance) of interactions which do not exist (or exist) in the uncollapsed variables (see Goodman 1979a), and thus to erroneous conclusions about the relationships in the table. For discussions on collapsing categories in polytomous variables, the interested reader should consult Duncan (1975), Feick (1984). or Goodman (1981). REFERENCES Bishop, Yvonne M.M., Stephen E. Fienberg, and Paul W. Holland (1975), Discrete Multivariate Analysis, Cambridge, Massachusetts: The MIT Press. Bock, R. Darrell and George Yates (1973), Multiqual: Log-Linear Analysis of Nominal or Ordinal Data by the Method of Maximum Likelihood, Chicago: National Educational Resources. Clogg, Clifford C. and Darwin O. Sawyer (1981), "A Comparison of Alternative Models for Analyzing the Scalability of Response Patterns," in Sociological Methodology 1981, ed; S. Leinhardt, San Francisco: Jossey Bass. Cox, D.R. (1970), The Analysis of Binary Data, New York, Methuen. DeSarbo, Wayne S. and David K. Hildebrand (1980), "A Marketer's Guide to Log-Linear Models for Qualitative Data Analysis," Journal of Marketing, 44 (Summer), 40-51. Dillon, William R., Thomas J. Madden, and Narenda Mulani (1983), "Scaling Models for Categorical Variables: An Application of Latent Structure Models," Journal of Consumer Research 10 (September), 209-224. Duncan, O.D. (1975), "Partitioning Polytomous Variables in Multiway Contingency Analysis," Social Science Research, 6 (December), 167-182. Feick, Lawrence F. (1984), "Analyzing Marketing Research Data with Association Models," Journal of Marketing Research, 21 (November), forthcoming. Fienberg, Stephen E. (1980), The Analysis of Cross-Classified Categorical Data, Cambridge, Massachusetts: The MIT Press. Glasser, Gerald J. and Gale D. Metzger (1972), "Random Digit Dialing as a Method of Telephone Sampling," Journal of Marketing Research, 9 (February), 59-64. Goodman, Leo A. (1968), "The Analysis of Cross Classified Data: Independence, Quasi-independence, and Interaction in Contingency Tables With or Without Missing Cells," Journal of the American Statistical Association, 63 (December), 1091-1131. Goodman, Leo A. (1969), "How to Ransack Social Mobility Tables and Other Kinds of Cross-Classification Tables," American Journal of Sociology, 75 (July), 1-40 Goodman, Leo A. (1970), "The Multivariate Analysis of Qualitative Data: Interactions Among Multiple Classifications," Journal of the American Statistical Association, 65 (March), 225-256. Goodman, Leo A. (1975), "A New Model for Scaling Response Patterns: An Application of the Quasi-Independence Concept," Journal of the American Statistical Association, 70 (December), 755-768. Goodman, Leo A. (1979a), "A Brief Guide to the Causal Analysis of Data from Surveys," American Journal of Sociology, 84 (March), 1078-1095. Goodman, Leo A. (1979b), "The Analysis of Qualitative Variables Using More Parsimonious Quasi-Independence Models, Scaling Models, and Latent Structures," in Qualitative and Quantitative Social Research, eds. R.M. Merton, J.S. Coleman, and P.H. Rossi, New York: The Free Press. Goodman, Leo A. (1981), "Criteria for Determining Whether Certain Categories in a Cross-Classification Table Should be Combined, With Special Reference to Occupational Categories in an Occupational Mobility Table," American Journal of Sociology, 87 (November), 612-650. Green, Paul E. (1978), "An Aid/Logit Procedure For Analyzing Large Multiway Contingency Tables," Journal of Marketing Research, 15 (February), 132-136. Groves, Robert M. (1978), "An Empirical Comparison of Two Telephone Sample Designs," Journal of Marketing Research, 15 (November), 622-631. Groves, Robert M., and Robert L. Kahn (1979), Surveys by Telephone: A National Comparison with Personal Interviews, New York: Academic Press. Haberman, Shelby J. (1979), Analysis of Qualitative Data, Volume 2 New DeveloPments, New York: Academic Press. Peterson, Robert A., Robert P. Leone, and Mohammad H. Sabertehrani (1981), "Investigating Income Refusals in a Telephone Survey by Means of Logit Analysis," in Advances in Consumer Research, 8, Kent B. Monroe, ed., Ann Arbor: Association for Consumer Research, 287-291. Rich, Clyde L. (1977), "Is Random Digit Dialing Really Necessary?" Journal of Marketing Research, 14 (August), 300-305. Tyebjee, Tyzoon T. (1979), "Telephone Survey Methods: The State of the Art," Journal of Marketing, 43 (Summer), 68-78. ----------------------------------------
Authors
Lawrence F. Feick, University of Pittsburgh
Volume
NA - Advances in Consumer Research Volume 12 | 1985
Share Proceeding
Featured papers
See MoreFeatured
Using a Meta-Analysis to Unravel Relative Importance of Postulated Explanations for the Endowment Effect
Peter Nguyen, Ivey Business School
Xin (Shane) Wang, Western University, Canada
David J. Curry, University of Cincinnati, USA
Featured
Changes in Environment Restore Self-Control
Nicole Mead, University of Melbourne, Australia
Jonathan Levav, Stanford University, USA
Featured
J14. You Reflect Me: Narcissistic Consumers Prefer Anthropomorphized Arrogant Brands
Norah Awad, Hongik University
Nara Youn, Hongik University