Assessing Inter-Judge Reliability: a Probabilistic Latent Class Approach

William R. Dillon, University of Massachusetts
Thomas J. Madden, University of Massachusetts
ABSTRACT - The purpose of this paper is to present and illustrate a probabilistic model for assessing inter-judge reliability. The proposed probabilistic model allows one to i) use formal test statistics to evaluate the extent and character of inter-judge reliability, ii) estimate the assignment error rates, and iii) test for simultaneous agreement for more than two judges. The probabilistic model is operationalized in terms of restricted latent class models.
[ to cite ]:
William R. Dillon and Thomas J. Madden (1984) ,"Assessing Inter-Judge Reliability: a Probabilistic Latent Class Approach", in NA - Advances in Consumer Research Volume 11, eds. Thomas C. Kinnear, Provo, UT : Association for Consumer Research, Pages: 46-51.

Advances in Consumer Research Volume 11, 1984      Pages 46-51


William R. Dillon, University of Massachusetts

Thomas J. Madden, University of Massachusetts


The purpose of this paper is to present and illustrate a probabilistic model for assessing inter-judge reliability. The proposed probabilistic model allows one to i) use formal test statistics to evaluate the extent and character of inter-judge reliability, ii) estimate the assignment error rates, and iii) test for simultaneous agreement for more than two judges. The probabilistic model is operationalized in terms of restricted latent class models.


Increasingly consumer behavior researchers are soliciting cognitive responses in addition to standard attitudinal measures when attempting to assess the effects of persuasive communications (Lutz and Swasy 1977; Mackensie and Lutz 1982; Madden 1982; Olson, Toy, and Dover 1978; Wright 1973, 1974, 1975, 1980). Cognitive responses are typically collected by simply asking respondents to list the thoughts they had during the presentation of a particular persuasive message. The coding of the elicited cognitive responses generally involves some sort of categorization, typically undertaken by independent judges. For example, each independent judge might be asked to categorize the elicited thoughts as reflecting either counterarguing, support-arguing, or source degradation (Wright 1980), or to assign each elicited thought to either a Positive, neutral, or negative category (Madden 1982). In either event, the quality of the data is, to a large degree, evaluated in terms of some reliability coefficient which reflects the extent to which the independent judges agreed. Since the veracity of the relationships uncovered with the use of the coded cognitive responses is predicated on the reliability and consistency of the judge's ratings, it is important to utilize procedures that can provide evidence on both the extent and character of the inter-judge codings.

Beginning in 1960, new approaches began to emerge for assessing inter-judge reliability: Cohen (1960; 1968) presented kappa and weighted kappa as chance-oriented measures of agreement between two raters, each of whom independently classifies each of a sample of subjects into one of k mutually exclusive and exhaustive categories; Light (1971) extended Cohen's kappa to measure conditional and joint agreement and, in addition, presented a statistic for evaluating the pattern of agreement between two Judges; Fleiss (1971) generalized Cohen's unweighted kappa to the case where each of a sample of subjects is rated on a nominal scale by the same number of raters, but where the raters rating one subject are not necessarily the same as those rating another; and Landis and Koch (1977) derived similar statistics for the case of possibly varying numbers of ratings per subject by applying a one-way analysis of variance model. More recently, Mitchell (1979) discusses the inter-observer agreement percentage, the reliability coefficient, and the generalizability coefficient all of which have been used to reflect the quality of data collected in observational studies; and Kaye (1980) argues for the use of false-alarm and missed-event probabilities as indices of reliability. Finally, in the context of the broader issue of assessing the veracity and general psychometric quality of rating data, Saal, Downey and Lahey (1980) recommend the use of a Rater x Rater MANOVA.

The purpose of this paper is to present and illustrate a probabilistic model operationalized in terms of latent structure models which can prove useful in assessing inter-judge reliability. The probabilistic models and corresponding latent class structures that are discussed provide the researcher with a more flexible, general, and comprehensive approach to assessing inter-judge reliability than previous approaches. Compared to the other available approaches which primarily provide information only on the extent of agreement (e.g., Cohen's unweighted and weighted kappa) the latent class modeling approach described herein can prove informative with respect to both the extent and character of the inter-judge reliability. Specifically, the proposed probabilistic latent class modeling approach has the following attractive features.

i) It allows one to use formal test statistics to evaluate both the extent and character of inter-judge agreement.

ii) It allows one to estimate the assignment category error rates.

iii) It allows one to test for simultaneous agreement for more than two judges, which is not possible with use of Cohen's kappa or its many variables.

While researchers using Cohen's kappa can compute means and standard errors which could then be used to make inferential statements, the procedure described herein provides a holistic modeling approach to the investigation of inter-judge reliability.


Suppose that J judges have been instructed to assign each of the" cognitive responses elicited from a sample of" respondents into one of C categories or classes. The cross-classification of the judges' ratings will produce a S-dimensional table having CJ cells. Each cell in the CJ table will be denoted by xj, j=1,2,...,CJ, where a particular ej represents a response pattern indicating the ratings of each judge. For example, with J=3 judges and C=3 assignment categories the 3-dimensional tables consist of 27 cells; in this case, the response vector, say, x1 = (1,1,1) indicates that all three judges placed the elicited thought under consideration into the first category and the corresponding cell count appearing in the table gives the total number of times all three judges agreed in the evaluations. Similarly, the response pattern x2 = (1,1,2) indicates that the first two judges placed the given thought in the first category, while the third judge placed it in the second category, and the corresponding cell count appearing in the table gives the total number of times the first two judges placed thoughts in the first category, but the third judge's assignment was in the second category. Following this reasoning, the 27 cells in the 3-dimensional table can be divided into those that reflect perfect agreement, namely, x1 = (1,1,1), x14 = (2,2,2) and x27 = (3,3,3), and those remaining cells that either reflect partial agreement, e.g., x2 = (1,1,2), or no agreement at all, e.g., x6 = (1,2,3).

The perfect or true agreement classes will be denoted by Vt and let 0t, where T E t=1  0t = 1, be the probability of observing any one of the T perfect agreement classes.

The probability of observing any particular response vector xk can be written as

EQUATION   (1)  and   (2)

In equation 1 the total probability for a given observed pattern of judge's ratings is the weighted sum of the appropriate conditional probability of observing the response vector given the perfect agreement class using the weights 0h. Equation 2 expresses the total probability in terms of error probabilities" for each judge. That is, each judge is presumed to have a unique error parameter ej where ytij = 1 if xk = Vt, but ytij = 0 otherwise. For example, as before assume J=3 judges and C=3 assignment categories, and consider response vector x2 = (1,1,2) With respect to first true agreement class, V1 = (1,1,1), the third judge has erred, so that

01eo1(1-e1)eo2(1-e2)e3(1-e3)o  =  01(1-e1)(1-e2)e3;     (3)

with respect to the second true agreement class, V2 = (2,2,2), the first two judges have erred, so that

02e1(1-e1)oe2(1-e2)oeo3(1-e3)  =  02e1e2(1-e3);     (4)

with respect to the third true agreement class, V3 = (3,3,3), all three of the judges are in error, so that

03e1(1-d1)oe2(1-e2)oe3(1-e3)o  =  03e1e2e3.     (5)

Thus, we can write

P(x2 = (1,1,2)) = 01(1-e1)(1-e2)e3 + 02e1e2(1-e3) + 03e1e2e3.    (6)

The probabilistic model shown in equation 2 simplifies if we assume that each judge has the same error structure; that is, if we assume that errors occur at a rate, say, y, then


where Yk is the number of errors in the observed pattern of the judges' ratings, xk, given that Vt is the true assignment class. Other variations on the probabilistic model shown in equation l are possible. How these are specified should become clear in the discussion of latent class models which follows.


The probabilistic model shown in equation 1 is actually a latent class model. Maximum likelihood estimates of the error parameters shown in equations 2 and 7 can be obtained by means of the MLLSA program developed by Clogg (1977). Equations 2 and 4 correspond to restricted latent class models. In this section we first discuss some general aspects of latent class analysis and then relate the latent class model to the probabilistic model developed in the previous section.

Conceptualization and Notation

Suppose that A and B are two categorical variables that exhibit association. Now consider adding a new categorical variable, X, to the system so that A and B are unrelated at each level of X. We would say that X explains, in some sense, the relationship between A and B. Latent structure analysis is an attempt to find variables such as X. where X is an unobserved (latent) instead of an observed (manifest) variable. An analogous situation is that of factor analysis, in which latent continuous variables, the factors, explain the relationships among the observed continuous variables. In general, we will consider only one latent variable, X, with (possibly) many categories. However, by placing certain restrictions on the parameters of the models developed, we can make X behave as if it represents more than one latent variable, and thus can construct more complex models (Goodman 1974; Clogg 1981; Madden and Dillon 1982).

Parameters of the latent class model are of two kinds: unconditional probabilities of being in each latent class, and conditional probabilities of making a particular response to an item, given membership in a certain latent class. The notation of Goodman (1974) will be used for the general case. The unconditional probabilities will be denoted by NX1, NX2,...; where NXi is the probability that an individual is in class i of the latent variable X. The conditional probabilities will be written as N AX/ij, meaning the probability that a person in latent class j will give response i to item A. From these parameters, the probability of observing any particular response pattern for items A, B, C, etc. can be computed with only the additional assumption that the responses to items are independent for people within each latent class. Though conditional independence is a restrictive assumption, this is what we actually mean by saying that the latent variable X explains the observed relationships in the data. In factor analysis terminology, this corresponds to the assumption that the factors completely explain the observed covariances.

Goodman (1974) illustrated how the parameters of the model can be estimated using maximum likelihood methods. The MLLSA program implements these methods. The program allows both unrestricted and restricted models to be fit, where the restrictions can either be to fix certain parameters at a given value, or to constrain parameters to equal each other.

Latent Class Models as a Framework

The models presented in this paper all result from specifying constraints on the latent class conditional probabilities. The error parameters in equations 2 and 7 can be obtained by fitting restricted latent class models having appropriate equality constraints. These constraints are equivalent to dividing the latent classes for judge j into sets Sj1, Sj2, . . ., such that conditional probabilities are constrained to be equal within each set. In general, if there are C assignment categories, then there will be at least C sets for each judge; conceptually they represent the true agreement classes. With 3=3 judges for C=3 assignment categories, three parameters ai, bi, and ci are sufficient to specify the conditional probability for each judge. If aj1 is the probability that the jth judge places the elicited thought into the first assignment category, given that the true agreement class is V1 = (1,1,1), then bj1 and cj1 represent errors (i.e., probabilities of disagreement), and the weighted error rate is 01(1-a11)+01(1-a21)+01(1-a31). For certain models, there can be constraints across the true agreement classes so that both subscripts on the ajt's, bjt's, and cjt's are not necessary. For other models various ajt, bjt and cjt might be fixed at one, or set at 0. As we will demonstrate, by considering alternative constraints such as these, many variations of a particular model can be generated.

A Latent Class Translation

Translating a particular probabilistic model into a latent class model is quite straightforward. To illustrate, assume once again. i=3 judges and C=3 assignment categories. One plausible hypothesis is that the judges have the same error rates. This hypothesis corresponds to the probabilistic model presented in equation 7 which postulates a uniform error rate. Letting J1, J2, and J3 denote the three judges, this model can be estimated by imposing the following equality constraints:


To see why this restricted model yields a uniform error rate, merely note that the first true agreement class implies that the only permissible (i.e., correct) assignment is to the first category; thus an assignment to either the second or the third category constitutes an error which occurs with probabilities EQUATION.

Analogous results hold for the other true agreement classes, so that each judge's unreliability (i.e., error rate) is also at the level a.

The probabilistic model of equation 1 corresponds to a restricted three-class latent structure with the following equality constraints:


Note that each judge can have a different error rate.

Some Other Special Latent Class Models

Other special cases of latent class inter-judge reliability models can be developed from premises that have been used in Guttman scaling and linear learning hierarchy models (Dayton and Macready 1976, 1980; Macready and Dayton 1977, 1980; Proctor 1970). A strict Guttman scale does not permit response errors. i.e..

a11 = a21 = a31 = 1

b12 = b22 = b32 = 1

c13 = c23 = c33 = 1

or, in terms of the latent class conditional probabilities,


Conceptually, such a model would mean that the judges are in perfect agreement, but obviously such a model is seldom correct. A variation of the Guttman model has been suggested by Goodman (1975). This form of the latent class model is called a quasi-independence model. This model is formed by adding an extra latent class on which there are no restrictions and making the Guttman restrictions as shown above. The classes with the restrictions are what Goodman calls "perfect scale types," while those in the unrestricted class are his "non-scale types " A quasi-independence model is considered in the example discussed below.

Still yet another latent class inter-judge reliability model can be developed from the linear learning hierarchy model. If the assignment categories possess some ordering, say, Positive, neutral, or negative, then a linear hierarchy exists, and the latent classes can be made to behave according to the presumed rank ordering. For example, suppose that the judges are allowed to err in favor of the neutral assignment category; that is, it is acceptable for one judge to assign the given thought to be positive (negative) category and another judge to assign it to the neutral category. Thus for the tth latent class, only responses to the same or immediately adjacent categories (t-l or t+l) are permissible. This type of latent class model which recognizes special kinds of disagreements is analogous to weighted kappa (Cohen 1968).


Several models of the type discussed earlier were fit to the data shown in Table 1. The data were collected in a persuasive communication study (Madden 1982) in which three judges, denoted hereafter by J1, J2, and J3, evaluated the cognitive responses of 164 respondents. On the basis of the elicited cognitive responses, the judges were instructed to categorize each respondent as being either positive, neutral or negative toward the manipulated object. Table 2 contains the results of fitting the various models.



Models M1 through M7 represent seven different combinations of leaving the conditional probabilities constrained and unconstrained. Model M1 is a three class latent structure which places no restrictions on the latent conditional probabilities. Note from Table 2 that Model Ma provides a very good fit. Table 3 presents the latent class parameter estimates for the model. The character of the latent class is well defined; latent class 1 is primarily a positive agreement class; latent class 2 is primarily a neutral agreement class; and latent class 3 is primarily a negative agreement class. By applying the estimated latent class parameters the latent reliability of the three judges can be easily assessed. For example, the estimated overall error rate associated with Judge J1 is a weighted average of his individual errors, that is.

[.433(0.1543+0.0135)]+[.331(0.0850+0.0741)] +[.226(0.0260+0.0562] = 0.1456.

Similarly, for Judge J2 we find

[.443(0.0125+.0000)]+[.331(0.3268+0.1004)] +[.226(0.0687+0.0258)] = 0.1683;

and for Judge J3

[.443(0.0658i0.0000)]+[.331(0.1137+0.0882)] + [.226(0.0000+0.2115)] = 0.1438.

Thus, although all three judges appear to have acceptable error rates, Judges J1 and J3 are more reliable than Judge J2. Note, however, that Judge J2 does extremely well with respect to positive and negative assignments.



The overall reliability of each assignment category is also important from the standpoint of assessing the quality of the data and specifically in investigating the pattern of errors. The reliability of each assignment category can be estimated by applying the (estimated) latent class parameter effects shown in Table 3. For example, in the first perfect agreement class, the estimated total probability of a neutral error is

.443[0.01543+0.0125+0.0658] = 0.1029;

whereas the estimated total probability of a negative error is, as expected, much lower, namely

.443[0.0135+0.0000f0.00003 = 0 0059,

Table 4 gives the estimated error rates for the assignment categories given each of the perfect agreement classes. From the table it is clear that the neutral assignment category is the least reliable. The other estimated error rates and reliabilites are quite encouraging.

Models M2 and 53 are restricted three-class latent structures. M2 fits the uniform error model (i.e., equation 7), and M3 fits the equal judge-specific error model (i.e., equation 1). Both models have been previously described in some detail. Neither of the models fits the data to an acceptable degree, at least from a statistical point of view.



The next two models to be fit, M4 and MS, are very similar to models M2 and M3. Model M4 is a special case of Model M2 and Model M5 is a special case of Model M3; in both models, the EQUATION parameter was freed. The goodness-of-fit of successive (i.e., nested) models can be compared by taking the difference in the likelihood-ratio chi-squares for the models, assuming that one of the models being compared fits the data to an acceptable degree. The difference is itself distributed as chi-square with degrees of freedom equal to the difference in degrees of freedom between the models. M4 can be compared to M2 and M5 can be compared to M3. The test of M2 versus M4 gives a chi-square of 41.64-24.92=16.72 with 1 degree of freedom, and the test of M3 versus MS gives 36.60-17.76=18.84 with 1 degree of freedom. Both of these values are significant at the 0.001 level. Thus, in rejecting Model M2 we are rejecting the constraint that the probability of an error is the same for all Judges and assignment categories. In rejecting Model M3 we are rejecting the constraint that, for Judge J2, the probability of each type of error is the same.



A direct comparison of Models N4 versus MS is not possible since one model is not a nested version of the other. In such cases the choice between them is a matter of personal preference, and must be made on the basis of theory, or other kinds of evidence. M5 fits the data slightly better than M3, but M3 is more parsimonious. Table 5 presents the latent class parameter estimates under each model. Under Model M3 Judges J1 and J3 make errors at the rate of 0.121, whereas for J2 the error rate is considerably higher, namely

[.468(0.1209+0.0001)3+[.3065(0.3316+0.1135)] +[.225(0.0773+0.0437)] = 0.220,

which can be traced to the Judge's faulty assignments in the neutral category. Under Model M4 each judge has a different probability of making an error, which is, with the exception of Judge J2, constant across assignment categories. Direct calculations show the following error rates:

1-0.8464 = 0.1536, for Judge J1,

1-0.8539 = 0.1461, for Judge J3,


[.4638(0.0315+0.0112) l+t .3152(0.3121+L0.1045)] +[0.2211(0.0427+0.0000)] = 0.1606, for Judge J2.

Once again Judge J3 is the most reliable, whereas Judge J2 is least reliable.

The final two models considered two Guttman scale-type models. Model M6 is a strict Guttman scale model in which no errors are permissible. As expected, this model did not provide an adequate fit to the data. Model M7 is a quasi-independence model which differs from M6 only in that an extra latent class on which there are no restrictions has been added. Though M7 is an improvement over M6, it does not provide an adequate fit, at least from a statistical viewpoint.




Clogg, C.C. (1977), "Unrestricted and Restricted Maximum Likelihood Latent Structure Analysis: A Manual for Users," Working Paper 1977-09, Population Issues Research Center, Pennsylvania State University, University Park, Penn.

Clogg, C.C. (1981), "New Developments in Latent Structure Analysis," in Factor Analysis and Measurement in Sociological Research, D.M. Jackson and E.F. Borgatta, eds., Beverly Hills, CA.: Sage.

Cohen, J. (1960), "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, 20, 37-46.

Cohen, J. (1968), 'Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit," Psychological Bulletin, 70, 213-20.

Dayton, C.M. and Macready, G.B. (1976), "A Probabilistic Model for Validation of Behavioral Hierarchies," Psychometrika, 41, 189-204.

Dayton, C.M. and Macready, G.B. (1980), "A Scaling Model with Response Errors and Intrinsically Unscalable Respondents," Psychometrika, 45, 343-356.

Fleiss, J.L. (1971), "Measuring Nominal Scale Agreement Among Many Raters," Psychological Bulletin, 76, 378-382.

Goodman, L.A. (1974), "The Analysis of Qualitative Variables When Some of the Variables are Unobservable, Part in Modified Latent Structure Approach," American Journal of Sociology, 79, 1179-1259.

Goodman, L.A. (1975), "A New Model for Scaling Response Patterns: An Application of the Quasi-Independence Concept," Journal of The American Statistical Association, 70, 755-768.

Kaye, E. (1980), "Estimating False Alarms and Missed Events From Interobserver Agreement: A Rationale," Psychological Bulletin, 88, 458-468.

Landis, 3.R. and Koch, G.G. (1976b), "A One-Way Components of Variance Model for Categorical Data," Biometrics, 33, 671-79.

Light, R.J. (1971), "Measures of Response Agreement for Qualitative Data: Some Generalizations and Alternatives," Psychological Bulletin, Vol. 76, No. 5, 365-377.

Lutz, R. and Swasy, J. (1977), "Integrating Cognitive Structure and Cognitive Response Approaches to Measuring Communication Effects," Advances in Consumer Research, ed. William Perreault, Atlanta: Association for Consumer Research, Vol. 4.

MacKenzie, S.B. and Lutz, R.J. (1982), "Monitoring Advertising Effectiveness: A Structural Equation Analysis of the Mediating Role of Attitude Toward the Ad," Working Paper Series, Center for Marketing Studies, Univ. of California, Los Angeles, (January) No. 117.

Macready, G.B. and Dayton, C.M. (1977), "The Use of Probabilistic Models in the Assessment of Mastery," Journal of Educational Statistics, Vol. 2, No. 2, Summer, 99-120.

Macready, G.B. and Dayton, C.M. (1980), "A Two-Stage Conditional Estimation Procedure for Unrestricted Latent Class Models," Journal of Educational Statistics, Vol. 5, No. 2, Summer. 129-156.

Madden, T.J. (1982), "Latent Structure Analyses of Message Reactions," Advertising and Consumer Psychology, eds. Larry Percy and Arch Woodside, Lexington: Lexington Books, 303-313.

Madden, T.J. and Dillon, W.R. (1982), "Causal Analysis and Latent Class Models: An Application to a Communication Hierarchy of Effects Model," Journal of Marketing Research, Vol. 19, 472-90.

Mitchell, S.K. (1979), "Interobserver Agreement, Reliability, and Generalizability of Data Collected in Observational Studies," Psychological Bulletin, Vol. 16? No. 2. 376-90.

Olson, J.C., Toy, D.R., and Dover, P.A. (1978), "Mediating Effects of Cognitive Responses to Advertising on Cognitive Structure," in Advances in Consumer Research, ed. K. Bunt, Ann Arbor, MI: Association for Consumer Research, Vol. 5.

Proctor, C.A. (1970), "A Probabilistic Formulation and Statistical Analysis of Guttman Scaling," Psychometrika, 35. 73-78.

Saal, F.E., Downey, R.G., and Lahey, M.A. (1980), "Rating the Ratings: Assessing the Psychometric Quality of Rating Data," Psychological Bulletin, Vol. 88, No. 2, 413-428.

Wright, P.L. (1973), "The Cognitive Process Mediating Acceptance of Advertising," Journal of Marketing Research. Vol. 10, 53-62.

Wright, P.L. (1974), "On the Direct Monitoring of Cognitive Responses to Advertising," Buyer/Consumer Information Processes, eds. G.D. Hughes and M.L. Ray, Chapel Hill, NC: Univ. of North Carolina Press.

Wright, P.L. (1975), "Factors Affecting Cognitive Resistance to Advertising," Journal of Consumer Research, Vol. 2, 60-7.

Wright, P.L. (1980), "Cognitive Responses to Mass Media Advocacy," Cognitive Responses in Persuasion, eds. R. Petty, T. Ostrom and T. Brock. Hillsdale, NJ: Erlbaum.