Beyond Kappa a Review of Interrater Agreement Measures

The above results show the effectiveness of CEW in the cases considered. They are analogous to those produced by κ; However, although the latter does not have a clear operational interpretation, the former has a formal but intuitive meaning: it measures the amount of standardized information exchanged between the two evaluators through the agreement channel. Conventional measurements are useful tools both for comparing different assessments of the same diagnostic results and for validating new assessment systems or devices. Cohen`s kappa (κ) is certainly the most popular matching method between two evaluators and has proven its effectiveness over the past sixty years. Nevertheless, this method suffers from some alleged problems that have been highlighted since the 1970s; in addition, its value strongly depends on the prevalence of the disease in the sample under consideration. This work introduces a new agreement index, the Information Agreement (AI), which appears to avoid some of the shortcomings of Cohen`s kappa and separates the contribution from the core of the agreement. These objectives are achieved by modelling the agreement – both in dichotomous and multivalued order-categorical cases – such as the information exchanged between two evaluators via the virtual diagnostic channel that connects them: the more information exchanged between the evaluators, the higher their agreement. To test its fair behavior and the effectiveness of the method, AI was tested in some cases known to be problematic for κ, in the context of machine learning and in a clinical scenario to compare ultrasound (US) and automated breast volume scanners (ABVS) as part of breast cancer imaging. Unlike κ, IA correctly measures the stochastic distance between PXY and PXPY, i.e. the distance of the two assessors from the independence condition; This is done by taking into account both the consistency and disagreement components of the common probability distribution of rates.

In addition, it has a precise meaning from an information technology point of view, as it represents the (standardized) amount of information exchanged between the two evaluators. In this sense, AI is a natural complement to Cohen`s κ to the extent of agreement. Kappa statistics are used to assess the agreement between two or more evaluators if the measurement scale is categorical. In this brief summary, we discuss and interpret the main features of kappa statistics, the influence of prevalence on kappa statistics and their usefulness in clinical research. We also introduce weighted kappa if the result is ordinal, and intraclass correlation to assess the match in a case where the data is measured on a continuous scale. Banerjee, M., Capozzoli, M., McSweeney, L. & Sinha, D. (1999).

“Beyond kappa: A review of interrater agreement measures,” The Canadian Journal of Statistics, Vol. 27, No. 1, pp. 3-23. doi.org/10.2307/3315487 Since MI(X,Y) is a measure of stochastic dependence between X and Y, one might think of using it to measure the correspondence between X and Y. Note from equation (5) that entropy, conditional entropy and mutual information are closely related, and that the observed agreement, in, is defined as the overall probability of a correspondence between the evaluators` evaluations and as the sum of the elements in the main diagonal of the OXY, i.e. in = pXY(1.1) + pXY(2.2). In contrast, the expected match, pe, is the overall probability of a randomly reported match – provided there is no correlation between the scores of the two assessors – and corresponds to the sum of the elements of the main diagonal of EXY, i.e. pe = pX(1)pY(1) + pX(2)pY(2). Cohen`s kappa can be defined on the basis of these two estimators as follows: where X is the set of possible values for X and q=| X |. This function measures the amount of information transmitted by the X variable and is supported by logq(| X |).

Note that Shannon`s entropy is not only one of the possible approaches to achieving this goal, but also the only one that fulfills some of the basic postulates needed to consistently define a measure of information [1, 30]. In our environment, we measure the agreement between two evaluators by modeling them as the amount of information flowing through the correspondence channel (CA), a virtual channel that connects random variables X and Y using the information path X ⇒ dimension of X ⇒ condition D⇒ rating of Y ⇒ Y (see Fig. 1). Finally, the machine learning scenario presented at the end of Section 2.5 was reviewed and Table 4 was created. The Spearman rank correlation coefficients (rs) reported in the table point out that IA and κ generate different rankings for all data sets and prove that they are not strictly equivalent. However, Pearson`s correlation coefficient (ρ) approaches 1 and therefore IA and κ are significantly correlated for all data sets except DS4. Therefore, in these cases, the difference in rank of the two indices is due to exchanges between pairs whose measures of agreement are close to each other, and from a qualitative point of view, the two indices behave in the same way. Since the Γq matrix models the matching channel in our parameter, it represents the relationship between X and Y and is immutable in terms of channel input.

Therefore, Γq is not affected by the prevalence of the disease, the contribution of which is on the contrary completely discharged in PX. Although MI still depends on prevalence, Gl. (9) conceptually isolates the essential core of agreement associated with the Γq matrix from the prevalence of the condition embedded in PX. It is important to note that we are not saying that PX is completely determined by prevalence; In fact, it also depends on how X partitions the sampled subjects. Since the probability distribution of a random variable is more uniform, plus the entropy of these variables [41], Eq. (7) means that mutual information is limited by the inefficiency of X and Y in the distribution of random subjects in q classes – possibly without reference to the actual conditions of subjects with the same cardinality. For example, if X classifies almost all sampled subjects in the same way, whether “have the condition” or not, then H(X) and MI (X, Y) are about 0, even if X and Y are the same assessor. This undesirable behavior can be overcome by normalizing MI(X,Y) relative to min{H(X),H(Y)}; Table 1f presents a scenario very similar to the scenario presented in Table 1e. The values of the two matrices are almost identical, with the exception of 3 samples, which are evaluated differently. This difference corresponds to a change of only 6% of all scoring pairs (3% if we decouple the ratings of the two evaluators), but leads to an increase of κ, which is relevant given the scale already discussed introduced in [33]; in fact, κ goes from 0.228 (i.e.

a fair agreement) in Table 1e to 0.681 (i.e. a substantial agreement) in Table 1f. In addition, the value of the AI increases to κ, but it goes from 0.073 to 0.342 and remains around a third of the maximum of the scale, which is much lower than a value for a substantial agreement. The objective of this work is multiple; it (i) represents a Shannon-style informational correspondence (AI) index for ordered-categorical dichotomous and multivalued cases, (ii) shows that AI conceptually generalizes Cohen`s kappa, (iii) proves that AI corrects some of the shortcomings of Cohen`s kappa, and finally (iv) justifies the use of our approach in real cases by applying it to a set of medical data in the literature. If we look for the disadvantages of the proposed approach, we must point out that AI is more difficult to calculate than κ, because they are logarithms, and for the same reason, IA cannot be calculated if the chord matrix contains something of 0. The first point is a small problem and can be easily overcome by using custom software. As for the latter, an extension of the proposed continuity index seems sufficient to circumvent the problem. This could be achieved, for example, by replacing all 0s in the matching matrix with a new variable ε and then calculating IA on a new matrix, as ε tends to 0 from the line (e.B. as in limx→0 + xlogx = 0).

Again, all these steps can be easily implemented in custom software. Using equation (7), it is easy to prove that the value of AI is in the interval [0,1]. Thus, the information agreement retains all the theoretical advantages of the information of measuring the agreement using mutual information, while mitigating concerns about the dependence of MI(X, Y) on the entropies of X and Y. .