Irt-Based Item Level Analysis: an Additional Diagnostic Tool For Scale Purification

Michel Laroche, Concordia University
Chankon Kim, St. Mary’s University
Marc A. Tomiuk, Concordia University
ABSTRACT - The psychometric characteristics of items within a three dimensional scale of acculturation were analysed with traditional methods and with Testgraf, a non-parametric Item Response Theory program. Testgraf emerged as a good diagnostic tool for assessing the quality of individual scale items. It permitted to easily determine where a potential problem may lie within an item. Its usefulness renders it an additional procedure that may be used in conjunction with traditional methods of scale purification. It is argued that the inclusion of Testgraf in the initial steps of the current measure development paradigm will enhance the paradigm’s ability to generate measures of high quality.
[ to cite ]:
Michel Laroche, Chankon Kim, and Marc A. Tomiuk (1999) ,"Irt-Based Item Level Analysis: an Additional Diagnostic Tool For Scale Purification", in NA - Advances in Consumer Research Volume 26, eds. Eric J. Arnould and Linda M. Scott, Provo, UT : Association for Consumer Research, Pages: 141-149.

Advances in Consumer Research Volume 26, 1999      Pages 141-149

IRT-BASED ITEM LEVEL ANALYSIS: AN ADDITIONAL DIAGNOSTIC TOOL FOR SCALE PURIFICATION

Michel Laroche, Concordia University

Chankon Kim, St. Mary’s University

Marc A. Tomiuk, Concordia University

[The authors gratefully acknowledge the financial support of the FCAR, Quebec. They also extend many thanks to professor J. O. Ramsay of McGill University for his help in analyzing the data with Testgraf.]

ABSTRACT -

The psychometric characteristics of items within a three dimensional scale of acculturation were analysed with traditional methods and with Testgraf, a non-parametric Item Response Theory program. Testgraf emerged as a good diagnostic tool for assessing the quality of individual scale items. It permitted to easily determine where a potential problem may lie within an item. Its usefulness renders it an additional procedure that may be used in conjunction with traditional methods of scale purification. It is argued that the inclusion of Testgraf in the initial steps of the current measure development paradigm will enhance the paradigm’s ability to generate measures of high quality.

INTRODUCTION

Adherence to traditional methods for scale purification assures researchers that a reasonably reliable and valid measure will emerge (Churchill, 1979; Gerbing & Anderson, 1988). However, these methods rest on Classical Test Theory whose assumptions are oten difficult to meet in practice. Moreover, traditional methods are inherently correlational and are therefore geared towards analysis at the composite level. An alternative nonlinear approach is based on Item Response Theory. Conceptually, IRT circumvents many of the problems (e.g., stringent assumptions) associated with Classical Test Theory. Applications involve an analysis at the item level rather than at the composite level. The purpose of this article is to demonstrate the usefulness of Testgraf (Ramsay, 1995a), a non-parametric IRT program, in the purification of interval scale items. This is done in conjunction with traditional methods.

TRADITIONAL ITEM AND SCALE PURIFICATION PROCEDURES

Prior to the use of a measure or scale of some construct, preliminary analysis of a composite measure usually involves attempts at identifying items which do not fair well in comparison to other items in the scale. Churchill (1979) has suggested a series of steps for identifying poor items and in purifying a composite measure. From a methodological perspective, all of the statistical procedures suggested by Churchill (1979) are linear and most are correlational (i.e., item-total correlations, coefficient alpha (a), exploratory factor analysis). Gerbing and Anderson (1988) have extended Churchill’s (1979) measure purification paradigm by stressing the use of confirmatory factor analysis (CFA) in order to test the unidimensionality of composite measures. Bagozzi (1994:331) holds that the CFA model hypothesizes "that the variance of each measure loading on a factor is a linear function of the underlying theoretical variable plus error: xi =lI x+di." Like exploratory factor analysis, this approach is similar to true-score theory which underlies Classical Test Theory (Bollen, 1989; Lord & Novick, 1968).

In fact, all of the traditional statistical procedures used in scale purification are rooted in Classical Test Theory (CTT). In general, this theory models responses to scale items as a function of some set of parameters that are specific to the given set of items and the given sample of respondents. In other words, item-level statistic, reliability, and factor analysis estimates will change from sample to sample and as items are added or deleted from scales (Hambleton, Swaminathan, & Rogers, 1991; Santor, Ramsay, & Zuroff, 1994:255). Moreover, the basic assumptions of Classical Test Theory are somewhat unrealistic. For instance, the assumption of parallel tests is quite difficult to meet in practice and is usually ignored in scale purification attempts (Allen & Yen, 1979). Moreover, Hambleton et al. (1991:4) argue that the various reliability coefficients proposed within the CTT framework "provide either lower bound estimates of reliability or reliability estimates with unknown biases." One such coefficient is Cronbach’s a (alpha). It’s use becomes inappropriate when the test cannot be divided into parallel or tau-equivalent sub-tests (Allen & Yen, 1979; Bagozzi, 1994). Cronbach’s a is also not invariant to scale changes in the indicators (Greene, 1977). Finally, another potential problem with CTT is that it is test-oriented rather than item-oriented. It therefore does not provide pertinent information such as item-difficulty or item-discriminability levels (Santor et al., 1994).

CTT’s orientation toward analysis at the composite level has clearly made various forms of factor analysis quite prevalent in the assessment of composite scales. However, factor analysis is a topic which is itself hampered by often stringent assumptions and various limitations. For instance, maximum likelihood estimation is predominant because of the many advantages it offers over distribution-free methods for parameter estimation (Anderson & Gerbing, 1988; Bagozzi, 1994; Long, 1976). However, various shortcomings emerge. In particular, ML estimation requires relatively large sample sizes (Anderson & Gerbing, 1988). It also impose the strong assumptions of linearity and normality which sometimes stand violated (Long, 1976; Takane & de Leeuw, 1987). Nevertheless, ML estimation has tended to display robustness under conditions of moderate violations of multivariate normality and sample size variations (see Boomsma, 1983; Browne, 1984).

ITEM RESPONSE THEORY

Item Response Theory (IRT) has traditionally been presented in achievement testing contexts. However, it has also been extended to the analysis of scale items such as those which are commonly seen in marketing research (e.g., Likert-type). Whereas the traditional procedures discussed above account for the correlation (i.e., reliability estimates and item-total correlation) or covariation (i.e., CFA) between scale items, IRT models account for responses to items by examinees or respondents (Reise, Widaman, & Pugh, 1993). IRT does this in a nonlinear and probabilistic fashion (Ramsay, 1995b). Two basic postulates drive IRT: (a) the performance of a respondent on a scale item can be explained by a factor called 'latent trait’ or 'ability’ (Q) and (b) the relationship between one’s particular item response and one’s standing on the unobservable trait measured by the entire test or composite can be represented by a monotonically increasing function called an Item or Option Characteristic Curve (Ramsay, 1995b). Typical OCCs appear in various texts on IRT (see Camilli & Shephard, 1994; Hambleton & Swaminathan, 1985; Hambleton et al., 1991; Lord, 1980). They are usually generated with a variety of parametric models (see Hambleton et al., 1991).

The notion that the probability of a correct response is independent of a particular examinee or the degree of item difficulty is fundamental to IRT. This probability is rather a function of Q, the latent trait or overall ability (Osterlind, 1983). In other words, item statistics generated by IRT models are presented strictly in terms of the relationship between the item and the underlying latent trait (McKinley & Mills, 1989). This characteristic is what renders IRT a powerful tool for the investigation of test and scale items. It is also that which circumvents many of the problems inherent in procedures rooted in Classical Test Theory. The notable consequence of this separation of items from persons is referred to as 'parameter invariance’ (Lord, 1980).

Estimated IRT models denote particular distributions of responses to items conditional on levels of Q (Reise et al., 1993). Traditionally, parametric IRT models have been employed to generate item characteristic curves. Despite the wide diffusion of parametric models, they have involved many shortcomings. In particular, most parametric models were designed for use with dichotomous response options in achievement testing (Santor et al., 1994). Typical illustrations of these models show item characteristic curves for either the 'correct’ or the 'incorrect’ response option. When polychotomous or interval scale items are analysed, the restrictions and limitations of parametric models become apparent for they usually require that responses to scale items be dichotomized (e.g., Mislevy & Bock, 1982). Moreover, these models do not permit the examination of multiple response options simultaneously (Santor et al., 1994). Another problem in using common parametric models has been estimation difficulties because of the great number of item and respondent parameters to be estimated (Ramsay, 1995a). Finally, these models usually require large sample sizes (Hambleton et al., 1991).

More recent IRT model formulations have emerged. Testgraf (Ramsay, 1995a) represents a non-parametric approach to IRT modelling. It allows for the examination of polychotomous scale items without dichotomization and the consequent loss information that this procedure involves. Moreover, the preservation of the original response options allows for the plotting of option characteristic curves which simultaneously depict the probability of endorsig one of many response options as a function of the latent trait. Testgraf estimation is based on a type of local averaging called kernel smoothing (Ramsay, 1991, 1995a, 1995b). Its output is discussed in detail in the results section. A good and short overview of Testgraf appears in Santor et al. (1994). The program runs under MS-DOS on IBM-compatible machines and is available at no cost from professor J. O. Ramsay (e-mail: ramsay@psych.mcgill.ca).

GOALS OF THE STUDY

This study will primarily serve to show the usefulness of Testgraf, and consequently that of IRT, in the measure purification process. First, we will provide a traditional analysis of our scale items based on suggestions in Churchill (1979) and on studies in marketing which have implemented CFA in their measure development process. Accordingly, we will base our item and scale analyses on item- total correlations, coefficient alpha, exploratory and confirmatory factor analysis. Next, Testgraf output will be presented for the same set of items. Item discriminability will emerge as an important criterion in evaluating items. It is expected that Testgraf may point out certain anomalies in our items that the traditional procedures were unable to detect.

METHOD

Measures

A measure of acculturation was developed for use with Italian-Canadians (Tomiuk, 1993). Prior to the generation of items, acculturation was broadly conceptualized as the acquisition of host culture traits (Keefe, 1980; Lee et al., 1991). Moreover, it was presented as multidimensional. An initial data collection led to a preliminary purification based on the suggestions in Churchill (1979). The resulting measure and subsequent data collection are discussed in this paper. They represent a second purification study. The measure as it stands now taps four dimensions of acculturation. For the illustrative purposes of this study we focussed our attention on only three dimensions: (a) English Language Use, (b) English-Canadian Social Interaction and Participation, and (c) English-Canadian Identification and Pride. The three-dimensional measure appears in Appendix I. Twenty five Likert- type items make up the instrument. Respondents were asked to indicate how strongly they personally agree or disagree with each statement by circling the appropriate number on a nine point scale. The anchor points were '1’ (Disagree Strongly) and '9’ (Agree Strongly). An Italian language version of the original English version was also developed. Please note that a full analysis of the four- dimensional measure is available in working paper form by contacting the main author (e-mail: laroche@vax2.concordia.ca).

Subjects

The questionnaire was administered via mail survey to respondents of Italian origin within a greater metropolitan area in Eastern Canada. An area sampling procedure was used. The respondents were given a choice between an English or Italian version of the questionnaire. A total of 312 usable questionnaires was collected. About half (50.6%) of the respondents were female and a substantial portion (43.1%) was in the 30-49 years of age category. The majority (97.7%) was married or living with someone. When asked about their place of birth, 24.7% reported that it was in Canada and 70.2% claimed that it was in Italy. Almost all (98%) reported having Canadian citizenship.

RESULTS

Traditional Item and Scale Analyses

For each of the three dimensions/composites of the measure, coefficient alpha and item-total correlations were computed. The nine item composite measure of English Language Use had an alpha estimate of .951. All item-total correlations were above .7 with the exception of that of item 4 which was the lowest (.6477). The six items of the English-Canadian Social Interaction and Participation dimension yielded an alpha estimate of .949. All item-total correlations were above .82. The ten item English-Canadian Identification and Pride Dimension resulted in an alpha estimate of .929 and none of its items had item-total correlation below. 5. Nevertheless, relatively lower item-total correlations appeared for item 1 (.531), item 2 (.689), item 6 (.582), and item 10 (.623) of this dimension. In sum, the coefficient alpha estimates were high and no item-total correlation was below .5. Accordingly, no items were deleted at this stage.

Next, exploratory factor analysis (ML estimation with oblique rotation) showed that the measure was indeed comprised of three clear factors which accounted for 68.1% of the variation in the data. An examination of the factor loadings indicated that the items generally loaded on their intended factors. Nevertheless, the following items exhibited loadings below .7: item 3 (.571) and item 4 (.655) of Factor 1 (English Language Use); item 1 (-.655) and item 2 (-.686) of Factor 3 (E-C Social Interaction and Participation); and item 1 (.478), item 2 (.538), item 6 (.449), and item 10 (.507) of Factor 2 (E-C Identification and Pride). These items were consequently removed from the instrument and were therefore not included in subsequent analysis (Netemeyer, Burton, & Lichtenstein, 1995).

TABLE 1

STANDARDIZED ESTIMATES OF CONFIRMATORY FACTOR MODEL (ML)

The remaining items were submitted to confirmatory factor analysis. The results appear in Table 1. Overall fit of the model was poor according the c2 which was estimated at 451.4 with 116 degrees of freedom (p=0.0). However, the c2 is not an appropriate measure of fit when sample sizes are large as in this case (Bagozzi & Yi, 1988). On the other hand, the Normed Fit Index (NFI) and the Comparative Fit Index (CFI) are known to be less sensitive to sample size effects and estimates greater than .9 are taken as indicative of a meaningful model (Bagozzi & Yi, 1988; Bentler, 1990). These indices equalled .92 and .94, respectively. The model therefore exhibited acceptable levels of overall fit. All factor loadings (l’s) were significant (t-value>/2/) and above .7. This indicated acceptable levels of reliability (Bagozzi, 1994; Bollen, 1989; Netemeyer et al., 1995). Nevertheless, some items exhibited relatively higher measurement error variances (Qd’s) and therefore relatively lower reliability. This was especially apparent for item 1 (.45) of the English Language Use factor and for item 3 (.36) of the E-C Identification and Pride factor. Moreover, some items had many relatively large residuals associated to them. Most apparent were item 3 and item 9 of the E-C Identification and Pride factor. Finally, some items displayed somewhat of a need to load on more than one factor. This was particularly evident for item 2 of the English Language Use factor, item 3 of the E-C Social Interaction and Participation factor, and item 4 of the E-C Ethnic Identification and Pride factor. Nevertheless, the modification indices for these items were all below 10. In sum, no item could be identified which frequently gave rise to large normalized residuals but which was also associated with one or two relatively high modification indices and/or relatively high error variance (i.e., unreliability). Therefore, we did not eliminate any additional items from the scale (see Anderson & Gerbing, 1988; Netemeyer et al., 1995; Kohli, Jaworski, & Kumar, 1993). The items eliminated in the previous step are highlighted in Appendix I.

TestGraf Analyses

The unidimensionality of Q is a main assumption which underlies most IRT models (amsay, 1995b; Reise et al., 1993). A principal components factor analysis of each dimension revealed that in all three cases a single factor dominated the data (i.e., first extracted factors accounted for more than 50% of the total variance in data).

Testgraf output appears in Appendix II. Analysis of Testgraf results is largely based on the visual inspection of two types of graphical output. Option characteristic curves (OCCs) for each item are represented by the upper of the two plots generated for each item. An OCC is produced for each option within each item. Nine OCC’s were therefore generated for each of the items and are presented simultaneously. Pjm(Q) appears on the vertical axis of the plot. It represents the probability of endorsing option m of item j as a function of Q, the latent trait (e.g., English Language Use). A function of Q appears on the horizontal axis. More than one scaling option for Q is available. Standard normal proficiency values were chosen because Testgraf indicated that the default option, expected total score, was not a strictly increasing function of the normal quantiles used to estimate Pjm(Q). For a typically 'good’ item, the probability of choosing option '9’ (Agree Strongly) will increase as Q increases whereas the probability of choosing option '1’ (Disagree Strongly) will decrease as Q increases. Additionally, intermediate response options ('2’ to '8’) will also show increases followed by decreases over ranges of Q. Another type of output produced by Testgraf is one which displays expected item score on the vertical axis rather than Pjm(Q). This is represented by the lower of the two plots generated for each item. Expected item score plots constitute a summary of the information contained in option characteristic curves for a particular item. Santor et al. (1994:257) state that "the rate of increase in expected item score signifies how effective or sensitive an item is to changes" in the latent trait. A typically 'good’ item will clearly discriminate between respondents endowed with different levels of the latent trait. Such an item will therefore display expected scores that increase somewhat rapidly as a function of Q (Ramsay, 1995a).

The nine items of Dimension 1 (English Language Use) exhibited slight to moderate anomalies. In many cases, the probabilities of endorsing options '1’ and '9’ dominated the entire range of the latent trait. For instance, the OCC’s for item 4 clearly showed that the probability of endorsing these options was higher than for intermediate options. Moreover the probability of choosing option '9’(Agree Strongly) was highest among the response options from below the mid level of Q to its highest level. With respect to items 1, 2, 3 and 8, the converse became apparent: the probability of endorsing option '1’ (Disagree Strongly) was greater than that of any intermediate response option over more than half of the entire range of Q. On the other hand, apparently 'better’ items were items 5, 6, 7, and 9. These items tended to show that respondents who displayed less of the latent trait tended to choose options '1’ through '4.’ The middle range of the latent trait was more or less clearly associated with option '5.’ As values of the latent trait scale increased, respondents were clearly more likely to choose higher level options. Expected item score plots usually showed average item score climbing rapidly as levels of the latent trait increased. The slope of the expected item score plot for item 3 was somewhat flatter than that of the other items. A break appeared in the expected score plots of items 4 and 8. The range of the latent trait over which this occurred for item 4 coincided with the rather lengthy range dominated by option '9’ in the corresponding option characteristic curve plot for the item. For item 8, the break represented the lower probabilities of endorsing options '8’ and '9’ at the highest levels of the latent trait. In sum, items 3, 4 and 8 appeared to be somewhat dysfunctional. They tended not to discriminat between respondents endowed with varying levels of the latent over particular ranges of the trait.

The six items of Dimension 2 (E-C Social Interaction and Participation) had OCCs which tended to be better than those of the previous set of items because of the less apparent dominance of options '1’ and '9’ over entire ranges of Q. However, the OCCs were concentrated and intersected at higher levels of the latent trait or to the right of the midpoint of the latent trait scale. Moreover, their expected item score plots all indicated that the items discriminated between respondents with levels of the latent trait above -1.5. Discriminability was not apparent in the lower ranges of the trait (i.e., from -2.5 to -1.5) for all six items.

OCCs and expected item score plots for the 10 items of Dimension 3 (E-C Identification and Pride) showed that this dimension contained some particularly 'good’ items which included items 3, 4, 5, and 9. The OCCs for item 9 represent an excellent item in terms of IRT criteria. Various anomalies were evident in the remaining items. The OCCs of items 2, 6, and 10 showed that response option '1’ dominated substantial ranges of the latent trait for each item. Those of items 1, 2, and 10 indicated that response option '9’ had a probability of about .6 or less of being endorsed at high levels of the trait. The expected score plots pointed to the relative lack of discriminability of items 1, 6, and 10. Their curves were all relatively flatter and those of items 1 and 6 showed very visible breaks in their monotonic progressions. Items 1, 6, and 10 were of the poorest quality in the ten item composite. In summary, inspection of Testgraf output indicated that items 3, 4, and 8 of Dimension 1 and items 1, 6, and 10 of Dimension 3 were perhaps good candidates for deletion. These six items are highlighted in Appendix I.

DISCUSSION AND CONCLUSION

Two different methodologies were used in order to determine the quality of 25 items designed to assess three dimensions of acculturation. These items appear in Appendix I. When highlighted, the particular item was designated for deletion by one or both methodologies. There is an apparent overlap between the ability of the traditional item purification paradigm and that of the IRT-based approach in identifying anomalous items. However, for certain items there is not. It therefore becomes evident that although the two methodologies provide different levels of analysis and criteria for assessing items, they are both apt at detecting particular anomalies. Traditional approaches focus on the entire scale and on the performance of individual items vis-a-vis the entire composite. They revolve around the notions of internal and external consistency (Churchill, 1979; Gerbing & Anderson, 1988). Traditional cutoff values for various estimates provide clear ways of assessing the quality of items.

On the other hand, IRT analysis is performed at the item level. Item discriminability (i.e., slope of curve) is perhaps the most important criterion for determining the contribution of an item in measuring a latent trait. Other criteria are the smoothness (monotonic progression) of expected score plots and the relative positioning of OCCs (e.g., does option 1 dominate the entire range of theta, the latent trait). In some cases, it was clear that an item was anomalous and that it should perhaps be deleted from the scale. In others, judgment needed to be exercised because we ourselves were unsure of the quality of the item and because the use of measurement scales formed with items which display different ranges of effectiveness is quite consistent with test construction theory (Nunnally, 1978). Moreover, the IRT-based approach emerged in some cases as a good procedure for pinpointing problems with items that the traditional approach was unable to detect. This was apparent for items of Dimension 2 (E-C Social Interaction and Participation) which aired well in terms of traditional criteria but did not discriminate well in the lowest regions of the latent trait. One obvious limitation of Testgraf is that there are no pre-assigned cut-off values for estimates by which one can objectively judge the quality of items. This remains interpretational and in some instances quite subjective. Another limitation is that expected total scores are iteratively derived from the items we wish to examine (see Santor et al., 1994).

To conclude, this paper does not dispute the value of the traditional item purification paradigm exemplified in the work of many researchers. It only suggests an additional and alternative nonlinear method which may enhance the basic paradigm’s ability in generating measures of high quality. Both approaches clearly provide useful diagnostics. They may apparently be used in conjunction with one another in order to assess and purify measures of latent constructs. Nevertheless, we suggest that an IRT-based analysis precede traditional methods. This would quickly identify problematic items in terms of discriminability. Clearly anomalous items would then be deleted.

APPENDIX I

APPENDIX II

TESTGRAF OUTPUT FOR 25 ITEMS SPANNING THREE DIMENSIONS OF ACCULTURATION

REFERENCES

Allen, M. J., & Yen, W. M. (1979). Introduction to Measurement Theory. Monterey, CA: Brooks/Cole Publishing Company.

Anderson, J. C., & Gerbing, D. W. (1988). Structural Equation Modeling In Practice: A Recommended Two-Step Approach. Psychological Bulletin, 103, 411-423.

Bagozzi, R. P. (1994). Structural Equation Models in Marketing Research: Basic Principles. In R. P. Bagozzi (Ed.), Principles of Marketing Research (pp. 317-385). Oxford, England: Blackwell Publishers.

Bagozzi, R. P., & Yi, Y. (1988). On the Evaluation of Structural Equation Models. Journal of the Academy of Marketing Science, 16, 74-94.

Bentler, P. M. (1990). Comparative Fit Indices in Structural Models. Psychological Bulletin, 107, 238-46.

Bollen, K. A. (1989). Structural Equations with Latent Variables. New York: Wiley.

Boomsma, A. (1983). On the Robustness of LISREL (Maximum Likelihood Estimation) Against Small Sample Size and Non-Normality. Unpublished Doctoral Dissertation. University of Groningen, Groningen.

Browne, M. W. (1984). Asymptotically Distribution-Free Methods for the Analysis of Covariance Structures. British Journal of Mathematical and Statistical Psychology, 37, 62-83.

Camilli, G. & Shephard, L. A. (1994). Methods for Identifying Biased Test Items. Thousand Oaks, CA: Sage.

Churchill, G. A., Jr. (1979). A Paradigm for Developing Better Measures of Marketing Constructs. Journal of Marketing Research, 16, 64-73.

Gerbing, D. W., & Anderson, J. C. (1987). Improper Solutions in the Analysis of Covariance Structures and A Comparison of Alternate Respecifications. Psychometrika, 52, 99-111.

Gerbing, D. W., & Anderson, J. C. (1988). An Updated Paradigm for Scale Development Incorporating Unidimensionality and Its Assessment. Journal of Marketing Research, 25, 186-92.

Greene, V. L. (1977). A Note on Theta Reliability and Metric Invariance. Sociological Methods and Research, 6(1), 123-128.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Publications.

Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory. Boston, MA: Kluwer-Nijhoff Publishing.

Keefe, S. E. (1980). Acculturation and the Extended Family Among Urban Mexican Americans. In A. M. Padilla (Ed.), Acculturation: Theory, odels, and Some New Findings (pp. 85-110). AAAS Selected Symposium 39. Boulder Colorado: Westview Press.

Kohli, A. K., Jaworski, B. J., & Kumar, A. (1993). Markor: A Measure of market Orientation. Journal of Marketing Research, 30, 467-77.

Long, S. (1976). Estimation and Hypothesis Testing in Linear Models Containing Measurement Error: A Review of Joreskog’s Model for the Analysis of Covariance Structures. Sociological Methods and Research, 5(2), 157-206.

Lord, F. C. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Lord, F. C., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.

McKinley, R. L., & Mills, C. N. (1989). Item Response Theory: Advances in Achievement and Attitude Measurement. Advances in Social Science Methodology, 1, 71-135.

Mislevy, R. J., & Bock, R. D. (1982). BILOG: Item Analysis and Test Scoring with Binary Logistic Models [Computer Program]. Mooresville, IN: Scientific Software.

Netemeyer, R. G., Burton, S., & Lichtenstein, D. R. (1995). Trait Aspects of Vanity: Measurement and Relevance to Consumer Behavior. Journal of Consumer Research, 21, 612-626.

Nunnally, J. C. (1978). Psychometric Theory, Second Edition. New York: McGraw-Hill.

Osterlind, S. J. (1983). Test Item Bias. Beverly Hills, CA: Sage Publications.

Ramsay, J. O. (1991). Kernel Smoothing Approaches to Nonparametric Item Characteristic Curve Estimation. Psychometrika, 56(4), 611-630.

Ramsay, J. O. (1995a). TestGraf: A Program for the Graphical Analysis of Multiple Choice Test and Questionnaire Data. Montreal, Canada: McGill University.

Ramsay, J. O. (1995b). Some Notes on the Statistical Analysis of Tests. Montreal, Canada: McGill University.

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory Factor Analysis and Item Response Theory: Two Approaches for Exploring Measurement Invariance. Psychological Bulletin, 114(3), 552-566.

Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric Item Analyses of the Beck Depression Inventory: Evaluating Gender Item Bias and Response Option Weights. Psychological Assessment, 6, 255-270.

Tomiuk, M. A. (1993). The Development and Content Validation of a Preliminary Multidimensional and Multicultural Measure of Culture Change for Italian-Canadians. Unpublished M. Sc. Thesis. Faculty of Commerce and Administration, Concordia University, Montreal, Canada.

Takane, Y. & de Leeuw, J. (1987). On the Relationship Between Item Response Theory and Factor Analysis of Discretized Variables. Psychometrika, 52, 393-408.

----------------------------------------