Translation Fidelity: an Irt Analysis of Likert-Type Scale Items From a Culture Change Measure For Italian-Canadians

Michel Laroche, Concordia University
Chankon Kim, St Mary’s University
Marc A. Tomiuk, Concordia University
ABSTRACT - Most prior attempts at assessing translation fidelity with item response theory have relied on parametric estimation methods. Such approaches are generally associated with various interpretational and computational limitations. On the other hand, the program TestGraf relies on a non-parametric estimation method from which stem various advantages. For instance, TestGraf is faster in comparison to parametric estimation procedures because it does not involve calculations of a great number of parameters. Furthermore, the procedure produces graphical output that is simple to interpret and therefore provides enhanced diagnostic capabilities in detecting problematic items. Moreover, Testgraf does not require that Likert-type scale item responses be dichotomized as in many of its precursors (e.g., BILOG). English and Italian versions of eight Likert-type scale items were submitted to TestGraf analysis. These items spanned one dimension of a multidimensional measure of culture change for Italian-Canadians. Visual inspection of the graphical output produced by Testgraf quickly and readily revealed that the English language items and their Italian translations displayed pervasive differential item functioning. In particular, plots consistently showed that pairs of Option Characteristic Curves did not coincide with each other for most of the response options to the eight items. The lack of cross-cultural measurement equivalence was also einforced by a numerical index computed by TestGraf for each response option to each item. Differential item functioning was attributed to the informality of the translation process. A formal back-translation procedure was suggested as a way of potentially eliminating this problem.
[ to cite ]:
Michel Laroche, Chankon Kim, and Marc A. Tomiuk (1998) ,"Translation Fidelity: an Irt Analysis of Likert-Type Scale Items From a Culture Change Measure For Italian-Canadians", in NA - Advances in Consumer Research Volume 25, eds. Joseph W. Alba & J. Wesley Hutchinson, Provo, UT : Association for Consumer Research, Pages: 240-245.

Advances in Consumer Research Volume 25, 1998      Pages 240-245

TRANSLATION FIDELITY: AN IRT ANALYSIS OF LIKERT-TYPE SCALE ITEMS FROM A CULTURE CHANGE MEASURE FOR ITALIAN-CANADIANS

Michel Laroche, Concordia University

Chankon Kim, St Mary’s University

Marc A. Tomiuk, Concordia University

[The authors gratefully acknowledge the financial support of the Fonds FCAR and the assistance of James O. Ramsay in giving us access to his Testgraf software.]

ABSTRACT -

Most prior attempts at assessing translation fidelity with item response theory have relied on parametric estimation methods. Such approaches are generally associated with various interpretational and computational limitations. On the other hand, the program TestGraf relies on a non-parametric estimation method from which stem various advantages. For instance, TestGraf is faster in comparison to parametric estimation procedures because it does not involve calculations of a great number of parameters. Furthermore, the procedure produces graphical output that is simple to interpret and therefore provides enhanced diagnostic capabilities in detecting problematic items. Moreover, Testgraf does not require that Likert-type scale item responses be dichotomized as in many of its precursors (e.g., BILOG). English and Italian versions of eight Likert-type scale items were submitted to TestGraf analysis. These items spanned one dimension of a multidimensional measure of culture change for Italian-Canadians. Visual inspection of the graphical output produced by Testgraf quickly and readily revealed that the English language items and their Italian translations displayed pervasive differential item functioning. In particular, plots consistently showed that pairs of Option Characteristic Curves did not coincide with each other for most of the response options to the eight items. The lack of cross-cultural measurement equivalence was also einforced by a numerical index computed by TestGraf for each response option to each item. Differential item functioning was attributed to the informality of the translation process. A formal back-translation procedure was suggested as a way of potentially eliminating this problem.

INTRODUCTION

It is common practice in investigations of cross-cultural adaptation to offer respondents a choice between a questionnaire in the language of the dominant group and a corresponding translation in their mother tongue. In fact, most empirically-based cross-cultural research relies heavily on measures developed in one language and later translated in another. Inferences regarding the behaviour of variables are significant only if the translation fidelity of the instrument has been made evident (Bontempo, 1993). Translation fidelity is therefore an issue of primordial importance to cross-cultural research. Nevertheless, translations remain "a major stumbling block in the path of rigorous, cross-cultural research" (Candell and Hulin, 1987, p.417).

The traditional approach to translations of instruments used in cross-cultural research is known as back-translation. This iterative process requires that the instrument be first prepared in the source language (e.g., English) and then translated into the target language (e.g., Italian). Later, the instrument is translated back into the source language. The two source language versions are then compared and adjustments are made to any discrepancies that appear between the two. However, back translation alone has been found to be insufficient to ensure semantic and psychometric equivalence (Bontempo, 1993; Hulin, 1987). Moreover, translated items, when submitted to traditional methods of analysis based on classical test theory, result in a variety of shortcomings. These procedures require making assumptions that are rather stringent and unrealistic. They therefore cannot be adequately met or tested (Malpass, 1977). In fact, they generally stand violated (Allen and Yen, 1979).

The purpose of Item Response Theory, as with any test theory, is to provide a basis for making estimates or inferences about latent traits measured by a test or scale. Generally, IRT has been presented as a means of circumventing the problems associated with classical reliability theory (Allen and Yen, 1979; Hambleton and Swaminathan, 1985; Lord, 1980). More recently, IRT has also been presented as a form of analysis which is very useful in assessing equivalence between scale versions or translations (Candell and Hulin, 1987). Psychometric equivalence between a source and a target language translation is determined by equivalence of response probabilities to source and target language items (Hulin, 1987). As a general rule, two versions of a scale are deemed equivalent only if all items in the scale are equivalent (Candell and Hulin, 1987).

Perhaps the greatest deterrents in using common parametric IRT models have been the difficulties encountered in estimating them. The great number of item and respondent parameters to be estimated still remains a definite problem. Moreover, programs for fitting such models are usually "slow, complex, and loaded with heuristic devices for preventing failure" (Ramsay, 1995a, p.93). On the other hand, the program TestGraf relies on a non-parametric estimation method. Specifically, its kernel smoothing procedure allows for quick estimation and produces output that describes respondent data with precision (Ramsay, 1995a). Moreover, TestGraf is user-friendly and requires only than one be acquainted with the rudiments of IRT. Furthermore, the program produces graphical output that is simple to interpret and it therefore provides enhanced diagnostic capabilities in detecting problematic items. In fact, the interpretation of its output relies mainly on the visual inspection of graphs rather than on tabulated parameter estimates. Finally, Testgraf does not require that Likert-type scale item resonses be dichotomized as in many of its precursors such as BILOG (see Ellis, Becker, and Kimmel, 1993). In other words, the integrity of the original scaling is maintained.

This study aims to illustrate how TestGraf may be used to detect differential item functioning between two language versions of a measure. For this purpose, English and Italian versions of eight Likert-type scale items each having nine response options were submitted to TestGraf for analysis. The items were part of a preliminary version of a multidimensional culture change measure developed for Italian-Canadians. Output from Testgraf is presented and interpreted. It is concluded that the measure displays much differential item functioning and that TestGraf appears to be a simple and effective method for assessing the merit of translations of polychotomous scale items.

CLASSICAL TEST THEORY

In Classical Test Theory, the responses to scale items by a given respondent are modelled as a function of some set of parameters that are specific to the given set of items and respondents. Therefore, if different respondents are used, item parameter values will change. For instance, the values of coefficient alpha and related item-to-total statistics will have a tendency to change from sample to sample. An additional problem arises when different stimuli are used at different times to measure the same construct as in the case of a source language item and of a poor target language translation. Here, respondent scores tend to change. Accordingly, Hambleton, Swaminathan, and Rogers (1991) propose that the most important shortcoming of Classical Test Theory is "that examinee characteristics and test characteristics cannot be separated: each can be interpreted only in the context of the other" (p. 2). Classical Test Theory is plagued by four other major problems. First, its most basic assumptions are almost impossible to meet in practice (Ramsay, 1995b). Secondly, its definition of reliability carries with it the assumption of parallel tests (Allen and Yen, 1979) which is once again quite difficult to meet in practice. Additionally, Hambleton et al. (1991) argue that the various reliability coefficients proposed within the Classical Test Theory framework "provide either lower bound estimates of reliability or reliability estimates with unknown biases" (p. 4). One such coefficient is a (alpha). As an internal consistency estimate, it allows the calculation of reliability when a test is given only once. Nevertheless, it is not appropriate when the test cannot be divided into parallel or essentially tau-equivalent parts (Allen and Yen, 1979, p.80). A third problem emerges with the standard error of measurement which is a function of test score reliability and variance. This conceptual converse of reliability in Classical Test Theory is wrongly assumed to be the same for all examinees (Hambleton et al., 1991). Accordingly, the Classical Test Theory assumption of equal errors of measurement for all examinees becomes implausible. A fourth and final problem with Classical Test Theory is that it is test-oriented rather than item-oriented (Hambleton et al.,1991, p.4). It therefore does not provide pertinent information such as item-difficulty or item-discriminability levels.

ITEM RESPONSE THEORY

A theory that mathematically links item responses to underlying traits is Item Response Theory. In general, IRT comprises a set of procedures for scaling both stimuli and respondents through the application of nonlinear, probabilistic models that describe the stimulus-respondent interaction in terms of latent variables (McKinley and Mills, 1989). IRT rests on two basic postulates: (a) the performance of a respondent on a scale item can be explained by a factor called 'latent trait’ or 'ability’ (Q) and (b) the relationship between an individual’s observable response to an item and the individual’s standing on the unobservable trait measured by the test can be represented by a monotonically increasing function called an 'Item Characteristic Curve’ or ICC (Ramsay, 1995b). Specifically, an Item Characteristic Curve depicts response probability as a function of the latent trait measured by the entire test or composite scale. A most fundamental notion of IRT is stressed in Osterlind (1983) who states with respect to testing situations that:

The probability of a correct response is independent of the examinee in the group considered. This is not to imply that P [i.e., the probability of a correct response] is independent of Q [i.e., the latent trait]-in fact, as described, P is a function of Q-but merely that the concept that a given examinee’s probability of getting an item correct depends on the examinee’s overall ability in the construct being measured rather than the degree of difficulty of the test item considered in the ICC [i.e., Item Characteristic Curve]. It is of paramount importance that this theoretical notion of the separation of items from persons so germane to IRT be clearly understood ... (p. 58-59)

The mathematical form of an Item Characteristic Curve varies with respect to the number of parameters used to describe the function. All IRT models contain one or more parameters describing the item and one or more parameters describing the respondent. An examination of the latter parameters applies more to achievement testing situations. For the purposes of scale development and assessment, we tend to be mostly concerned with the former, or the parameters describing an item.

The shortcomings of Classical Test Theory (i.e., the dependence of item statistics on the sample of respondents, the dependence of respondent statistics on the set of items, and the unrealistic implications of its basic assumptions) are for all intents and purposes eliminated in Item Response Theory. IRT gets around these problems by presenting item statistics in terms of the relationship between an item and the underlying variable/latent trait being measured. Accordingly, IRT-based approaches do not present item statistics in terms of both the item and the particular sample. This aspect of IRT is referred to as 'parameter invariance.’ In fact, it has been verified that for any given level of ability the Item Characteristic Curve will remain the same for every subgroup of respondents (Lord, 1980). This is where rests most of the power of IRT (McKinley and Mills, 1989).

The notion of local independence and the unidimensionality of the latent trait Q are the two main assumptions which underlie most IRT models (Osterlind, 1983; Ramsay, 1995b). The first implies that if answering one item influences answers to subsequent items, then the items are not locally independent. The second assumption specifies a unidimensional latent trait Q. It implies that an individual’s responses to scale items can be attributed to a single trait or ability. Factor analysis readily reveals whether two or more factors underlie a particular set of items. If so, the test or scale is not suitable to most forms of IRT analyses.

DIFFERENTIAL ITEM FUNCTIONING

We noted earlier that for two versions of a scale to be equivalent, it is necessary that all pairs of items (i.e., source and target versions) be equivalent. This calls for analyses at the item level rather than at the composite level as is customary with techniques based on Classical Test Theory. Such a framework is provided by IRT. The notion of Differential Item Functioning (DIF) or item bias flows directly from the notion of Item Characteristic Curve and that of parameter invariance. Specifically, bias is estimated for an item by a difference in the Item Characteristic Curves for two roups when their Q’s or latent traits are equated. In other words, when both groups are placed on the same scale (Osterlind, 1983). If different probabilities appear for the two groups, bias or DIF is said to have occurred. Pim(Q) is the fundamental notion in Testgraf. This function relates the probability of choosing option m (i.e., '1’ to '9’) for item i to a level of theta, the latent trait. In the event of differential item functioning or item bias, Pim(Q) will not be constant from group to group. On the other hand, measurement equivalence is achieved if the two item response functions are similar (Candell and Hulin, 1987).

METHOD

Description of Measures

The preliminary version of Italian Culture Change Questionnaire contains two sections. The first is designed to tap participation in the Italian culture while the second assesses participation in the English-Canadian culture. Both sections predominantly contain Likert-type items ranging from '1’ (Disagree Strongly) to '9’ (Agree Strongly). Nine conceptual dimensions are proposed for each of the two sections. Eighty one items tap Italian culture maintenance and eighty four items assess acquisition of E-C culture. Apart from occasional variations in item content, one section essentially mirrors the other.

All English language source items were translated to Italian by a first translator. This resulted in two versions of the questionnaire. Translations were then checked by another translator for accuracy. Back-translation per se did not occur in a formal fashion. For the sake of brevity, this paper focuses on analyses of items from one of the eighteen conceptual dimensions, namely: 'Social Interaction with Other Italians.’ The eight English language items along with their Italian language translations appear in Appendix I. A complete description of the measure along with conceptual underpinnings appear in Tomiuk (1993).

Subjects

The questionnaire was administered via mail survey to respondents of Italian origin within the greater metropolitan area of Montreal, Canada. Each was given a choice between an English-language version or an Italian-language translation. The response rate was approximately 29%. A total of 223 usable English language versions and of 70 usable Italian language versions were collected. Most respondents (55.4%) were female and a substantial portion (29.6%) was in the 30-39 years of age category. A slightly lesser number (25.5%) appeared in the 20-29 years of age category. When asked about their place of birth, 51.7% reported that it was in Canada and 46.9% claimed that it was in Italy. Most (98%) reported having Canadian citizenship.

Test of the Unidimensionality Assumption

The eight items in each version were submitted to factor analysis separately. Principal components factor analysis revealed that in each case, the first factor accounted for 51.5% of the total variance. The eigenvalues associated with the first factors generated from both versions equalled 4.12. The second factors had eigenvalues of .97 for the English version and 1.11 for the Italian version, respectively. These factors only accounted for 12.1% of the total variance in the English version and for 13.9% of the total variance in the Italian version, respectively. Clearly, a single factor predominated both versions of this scale. We took this as an indication that the unidimensionality assumption was met.

Testgraf Analyses

Analysis of Testgraf results is largely ased on the visual inspection of graphical output. Option Characteristic Curves is one type of output used in the assessment of DIF. An Option Characteristic Curve is produced for each option within each item for each group. In the case of an item assessed on a nine point scale, a total of nine pairs of Option Characteristic Curves will be generated for that item. Figure 1 contains two pairs of Option Characteristic Curves. The upper pair reflects the functioning of Item 1 at the scale response option '1’ while the next pair of curves are associated with Item 8 at response option '8.’ For each graph, the curve labelled '1’ relates to the English version of the item whereas the curve labelled '2’ is associated with the Italian version of the item. Pim(Q) appears here on the vertical axes. h(Q), a function of theta or the latent trait, is the entire scale score that respondents with trait level Q will, on the average, have (Ramsay, 1995a). Accordingly, it appears as 'expected score’ on the horizontal axes. In this case, Q, the latent trait, is 'Social Interaction with Other Italians’ and constitutes a dimension of ethnic identification or of ethnic culture maintenance (Phinney, 1990).

Differential item functioning, if present, will manifest itself by the two Option Characteristic Curves not coinciding with each other as do the two pairs of curves in Figure 1. The most apparent difference between the curves for Item 1 is that the probability of choosing option '1’ is greater for the English version of the item for all but the highest levels of the latent construct. The probability of choosing option '8’ for Item 8 is also greater at higher levels of Q with the English language version of the item. In all, the English version items displayed this tendency quite generally. In other words, the probabilities for choosing a particular option tended to be higher for English language items than for their Italian counterparts at higher levels of Q, the latent trait.

Testgraf also provides b indices which summarize the amount of DIF for each response option to an item. These appear to the right of the plot. The upper index (bR) is applicable only when more than one group is being compared to the reference group (i.e., English version). Accordingly, it is irrelevant here. The lower index (bF) relates to the focal group (i.e., Italian version). In general, DIF is present if the index is negative or positive. In Figure 1, the bF for Item 1/option '1’ and for Item 8/option '8’ both equal -0.09 thus indicating bias between the two language versions of each item. The bF values for the eight items and each of the nine options appear in Table 1. A simple count of the number of negative and positive values across rows indicates that many the seventy two pairs of Option Characteristic Curves generated by Testgraf show differential item functioning.

Another type of graph that is generated by the program appears in Figure 2. Two pairs of curves are presented for Item 3 and Item 6, respectively. 'Expected item score’ is plotted on the vertical axis and 'expected score’ appears on the horizontal axis of each graph. These plots summarize the information contained in the nine pairs of Option Characteristic Curves generated per item. DIF appears to be present for both items. The English version of Item 3 (Curve 1) shows higher expected item scores than the Italian version (Curve 2) at various levels of the latent trait. Less DIF is however apparent for Item 6 where the curves begin to coincide throughout their progression on 'expected score.’ In fact, Item 3 represents one of the better items in terms of less DIF among the eight items in the composite scale.

In sum, visual inspection of all the plots generated by Testgraf for the eight item scale along with the information contained in Table 1 revealed that pervasive item bias was present throughout the eight item scale. In some instances, item bias was moderate but still present. In other cases, it was very apparent. Most items were associated with curves which showed clear differntial item functioning and most options for each item displayed various anomalies comparable to those that appear in Figure 1.

DISCUSSION AND CONCLUSION

This article presents a method for assessing the presence of differential item functioning for Likert-type items translated from a source to a target language. The method relies heavily on visual inspection of graphical output by Testgraf, an Item Response Theory program. IRT presents many advantages over Classical Test Theory which underlies traditional scale purification approaches. These approaches still predominate in the area of consumer research (e.g., Churchill, 1979; Zaichkovsky, 1985). Traditional approaches produce and rely on statistics that vary from sample to sample. Moreover, traditional approaches are limited to analyses at the composite level and their assumptions are unrealistic and often stand violated (Allen and Yen, 1979; Hambleton and Swaminathan, 1983). On the other hand, IRT is geared toward analysis at the item level and its power rests in parameter invariance. IRT therefore presents a wide range of applications to scale purification and to checks of differential item functioning between groups of individuals. Its applications have however been limited by parametric estimation methods. The IRT program TestGraf circumvents many of the difficulties encountered with parametric estimation methods and provides a simple and user-friendly approach to IRT analysis. Moreover, it does not require that polychotomous scales be dichotomized prior to IRT analysis.

FIGURE 1

TESTGRAF OUTPUT:

PAIRS OF OPTION CHARACTERISTIC CURVES

TABLE 1

bF INDICES FOR PAIRS OF OPTION CHARACTERISTIC CURVES

FIGURE 2

TESTGRAF OUTPUT:

EXPECTED ITEM SCORE AS A FUNCTION OF EXPECTED SCORE ON THE ENTIRE SCALE

Typically, many scales are developed for use in one culture and then are used to assess respondents from another culture. Clear reliability and validity issues arise with respect to the use of a scale in a culture other than that for which it was originally developed. Moreover, uses of a scale in another culture usually require translation. It is therefore also necessary to establish whether items function equivalently for both cultural groups (Ellis et al., 1993). IRT allows one to ascertain whether equivalent or differential item functioning is present between two versions of a scale. In the event that high levels of DIF are present between source and translated versions, translation problems remain but one possible source of non-equivalence. Nevertheless, they are the most likely source of difficulty with respect to our scale because formal back-translation was not undertaken by the translators we employed. They simply went over the two sets of items and picked out the items translated in Italian that didn’t sound or seem to be equivalent to their English counterparts.

Another potential source of difficulty rests in the estimation of the latent trait Q. Testgraf analyses were based on 'expected score’ as an approximation/estimate of the latent trait. This score depends on the items used to form the entire scale. Some of these items may have been inappropriate for the assessment of Italian ethnic identification. These potentially unfit items were then simultaneously being assessed for translation bias (Hulin, 1987). Nevertheless, the content validity of the English language items has been formally assessed by expert judges who, for the majority, were well published in the field of cross-cultural adaptation (see Tomiuk, 1993).

In sum, what seems to be suggested here is a reevaluation of the Italian version of the questionnaire. A formal back-translation process should be undertaken until equivalent item forms are generated. Data would then again have to be gathered and submitted to Testgraf analyses.

APPENDIX 1

MEASUREMENT ITEMS

REFERENCES

Allen, M. J., and W. M., Yen (1979), Introduction to Measurement Theory (Monterey California: Brooks/Cole Publishing Company).

Bontempo, R. (1993), "Translation Fidelity of Psychological Scales: An Item Response Theory Analysis of An Individualism-Collectivism Scale," Journal of Cross-Cultural Psychology, 24(2), 149- 166.

Candell, G. L. and C. L. Hulin (1987), "Cross-Language Translations and Cross-Cultural Comparisons in Scale Translations," Journal of Cross-Cultural Psychology, 17(4), 417-440.

Churchill, G. A., Jr. (1979), "A Paradigm for Developing Better Measures of Marketing Constructs," Journal of Marketing Research, 16, 64-73.

Ellis, B. B., Becker, P. and H. D. Kimmel (1993), "An Item Response Theory Evaluation of an English Version of the Trier Personality Inventory," Journal of Cross-Cultural Psychology, 24(2), 133-148.

Hambleton, R. K., Swaminathan, H. and H. J. Rogers (1991), Fundamentals of Item Response Theory (Newbury Park, CA: Sage Publications, Inc.).

Hambleton, R. K. and H. Swaminathan (1985), Item Response Theory (Boston, MA: Kluwer-Nijhoff Publishing).

Hulin, C. L., Drasgow F. and F. J. Smith (1983), Item Response Theory: Applications to Psychological Measurement (Homewood, IL: Irwin).

Hulin, C. L. (1987), "A Psychometric Theory of Evaluations of Items and Scale Translations: Fidelity Across Languages," Journal of Cross-Cultural Psychology, 18(2), 115-142.

Lord, F. M. (1980), Applications of Item Response Theory to Practical Testing Problems (Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers).

Malpass, R. S. (1977), "Theory and Method in Cross-Cultural Psychology," American Psychologist, 32, 1069-1079.

McKinley, R. L. and C. N. Mills (1989), "Item Response Theory: Advances in Achievement and Attitude Measurement," Advances in Social Science Methodology, 1, 71-135.

Osterlind, S. J. (1983), Test Item Bias (Beverly Hills, CA: Sage Publications).

Phinney, J. S. (1990), "Ethnic Identity in Adolescents and Adults: Review of Research," Psychological Bulletin, 108(3), 499-514.

Ramsay, J. O. (1995a), TestGraf: A Program for the Graphical Analysis of Multiple Choice Test and Questionnaire Data (Montreal, Canada: McGill University).

Ramsay, J. O. (1995b), Some Notes on the Statistical Analysis of Tests (Montreal, Canada: McGill University).

Tomiuk, M. A. (1993), The Development and Content Validation of a Preliminary Multidimensional and Multicultural Measure of Culture Change for Italian-Canadians, Unpublished M. Sc. Thesis, Faculty of Commerce and Administration, Concordia University, Montreal, Canada.

Zaichkowsky, J. L. (1985), "Measuring the Involvement Construct," Journal of Consumer Research, 12, 341-352.

----------------------------------------