Since posting about the area under the ROC curve and “concordance probability” as a potential measure of treatment effect, I’ve wanted to better understand the history and current state of this idea. To what extent have statisticians investigated this measure and developed methodology around it? To what extent has the “concordance probability” been used in practice? Are there reasons why it hasn’t been more widely used?
This is a summary of literature on \(\text{P}(X > Y)\) as a measure for comparing two population distributions and quantifying treatment effect. By \(X\) and \(Y\), I mean random outcomes from the two populations being compared (e.g. \(X\) = outcome for a patient treated with an experimental treatment, and \(Y\) = outcome for a patient receiving a control treatment). I found that several related quantities have been proposed, and I include references about those.
Searching the literature for this topic has been a little difficult because there’s no single term used consistently for \(\text{P}(X > Y)\) (some authors don’t even attempt to name it). I suspect there are still important references out there that I haven’t found yet, and if so, I’ll keep adding to this post.
P(X>Y) and P(X>Y) + ½ P(X=Y)
Gini, C. (1916), “Il concetto di ‘transvariazione’ e le sue prime applicazioni,” Giornale degli Economisti e Rivista di Statistica, 53, 13–43. According to Kruskal,1 Gini proposed a quantity equivalent to \(\text{P}(X > Y) + \frac{1}{2} \text{P}(X = Y)\) which he called the probability of transvariation. I haven’t actually accessed Gini’s paper (and wouldn’t be able to read the Italian).
Ottaviani, G. (1939), “Sulla probabilità che una prova su due variabili casuali \(X\) e \(Y\) verifichi la disuguaglianza \(X<Y\) e sul corrispondente scarto quadratico medio,” Giornale dell’ Istituto Italiano degli Attuari, 10, 186–192. Again according to Kruskal,2 Ottaviani wrote about estimation of \(\text{P}(X > Y)\) using \(\frac{U}{mn}\). Here I use \(U\) to refer to the statistic that Mann and Whitney (1947) would later propose.
Wilcoxon, F. (1945), “Individual comparisons by ranking methods,” Biometrics Bulletin, 1(6), 80–83. Proposed the rank-sum test for two-sample comparison but didn’t specify the statistical hypotheses being tested or a parameter that might be estimated. Ties were handled by assigning an average rank to each outcome in a group of tied outcomes. The rank-sum test is still one of the most common non-parametric tests, and it was a basis for development of hypothesis testing and estimation for \(\text{P}(X > Y)\).
Mann, H. B., and Whitney, D. R. (1947), “On a test of whether one of two random variables is stochastically larger than the other,” Annals of Mathematical Statistics, 18(1), 50–60. Proposed specific hypotheses to be tested by a Wilcoxon-inspired, rank-based test: a null hypothesis of equal cdfs between two populations, and an alternative of one population cdf being stochastically larger, i.e. \(F_X(z) > F_Y(z)\) for all \(z\). The test statistic was \(U\), the number of pairs \((x_i, y_j)\) for which \(x_i > y_j\) among all pairs of an outcome from one sample (\(x_i\)) with an outcome from the other (\(y_j\)). It was assumed that no ties occur. A simple formula relates \(U\) to Wilcoxon’s rank-sum statistic. Still no parameter was identified — the only interest was in hypothesis testing.
Birnbaum, Z. W. (1956), “On a use of the Mann-Whitney statistic,” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1, 13–17. Gave attention to \(p = \text{P}(X > Y)\) as a parameter one might want to estimate — for example, in evaluating the reliability of mechanical components. Considered estimation of \(p\) using \(\hat{p} = \frac{U}{mn}\), the Mann-Whitney statistic divided by the number of pairs in the sample, and construction of confidence intervals when one can’t rely on large-sample normality.
Klotz, J. H. (1966), “The Wilcoxon, ties, and the computer,” Journal of the American Statistical Association, 61(315), 772–787. Suggested \(\text{P}(X > Y) + \frac{1}{2} \text{P}(X = Y)\) as a reasonable measure of the degree of separation of two distributions, in the context of discussing the null distribution for the Wilcoxon (Mann-Whitney) test.
Wolfe, D. A. and Hogg, R. V. (1971), “On constructing statistics and reporting data,” The American Statistician, 25(4), 27–30. Suggested that \(\text{P}(X > Y)\) is a measure for comparing two distributions that is widely applicable and more immediately interpretable than, for example, \((\mu_2 - \mu_1) / \sigma\). Showed how \(\text{P}(X > Y)\) can be a basis for common statistical comparisons and tests, both parametric and non-parametric.
Hand, D. J. (1992), “On comparing two treatments,” The American Statistician, 46(3), 190–192. Noting that 1) the probability that an individual will have a better outcome from treatment A than treatment B, \(\text{P}(z_A - z_B > 0)\), is a sensible measure of treatment effect, and 2) in some situations it’s feasible only to estimate the probability that an individual treated with A will have a better outcome than an independently sampled individual treated with B, \(\text{P}(x_A - y_B > 0)\); considered when those two parameters can be qualitatively opposite (one indicating treatment benefit, the other indicating harm).
McGraw, K. O. and Wong, S. P. (1992), “A common language effect size statistic,” Psychological Bulletin, 111(2), 361–365. Proposed using \(\text{P}(X > Y)\), which they term CL, as a measure of effect size, writing that it “is so readily understood by nonstatisticians that we have chosen to call it the common language effect size indicator.” Echoing Wolfe and Hogg (1971), they considered CL more readily interpretable than Cohen’s d (a standardized mean difference) and several other effect size indicators.
Grissom, R. J. (1994), “Probability of the superior outcome of one treatment over another,” Journal of Applied Psychology, 79(2), 314–316. Cited Wolfe and Hogg (1971), and McGraw and Wong (1992), but suggested calling \(\text{P}(X > Y)\) the probability of superiority (PS).
Hauck, W. W., Hyslop, T., and Anderson, S. (2000), “Generalized treatment effects for clinical trials,” Statistics in Medicine, 19, 887–899. Drawing inspiration from a paper3 about improving two-sample comparisons, suggested \(\text{P}(X > Y)\) specifically as the target parameter for assessing treatment effects in clinical trials. Noted that it’s easily understood by clinical, non-statistical colleagues.
Vargha, A., and Delaney, H. D. (2000), “A critique and improvement of the ‘CL’ common language effect size statistics of McGraw and Wong,” Journal of Educational and Behavioral Statistics, 25(2), 101–132. Commenting on McGraw and Wong’s (1992) proposal of CL = \(\text{P}(X > Y)\) as an effect size metric, Vargha and Delaney suggested \(\text{P}(X > Y) + \frac{1}{2} \text{P}(X = Y)\) instead, denoting it A and naming it the measure of stochastic superiority. They wrote, “If [the outcome] is discrete], A can be interpreted as an estimate of the value of CL that would be obtained if the distribution of [the outcome] were continuous.”
Acion, L., Peterson, J. J., Temple, S., and Arndt, S. (2006), “Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects,” Statistics in Medicine, 25, 591–602. Yet another proposal to use the “probabilistic index” \(\text{P}(X > Y)\) as an intuitive measure of treatment effect. Described calculating \(\hat{\text{P}}(X > Y)\) in a manner that actually would estimate \(\text{P}(X > Y) + \frac{1}{2} \text{P}(X = Y)\).
Senn, S. (2006), “Probabilistic index: an intuitive non-parametric approach to measuring the size of treatment effects” (Letter to the Editor), Statistics in Medicine, 25, 3944–3946. Argued that \(\text{P}(X > Y)\) and related measures of treatment effect are difficult for a clinician to understand and explain, with \(\text{P}(X > Y)\) being easily mistaken for the probability of benefit; and are not “robust” or clinically meaningful because they depend on “nuisance parameters” such as variance and covariance, and on measurement error. In particular, argued that outcome variance observed in a clinical trial is unlikely to represent the variance in the target population. These arguments were given again, more fully, elsewhere.4 This is the only objection to \(\text{P}(X > Y)\) as a measure of treatment effect that I’ve seen.
Agresti, A. (2010), Analysis of Ordinal Categorical Data (2nd ed.), Hoboken, NJ: John Wiley and Sons. Used \(\alpha\) to denote the “stochastic superiority” measure \(\alpha = \text{P}(X > Y) + \frac{1}{2} \text{P}(X = Y)\) in the setting of an ordinal outcome compared between two groups, giving credit to Vargha and Delaney5 for the term stochastic superiority (p. 13–14). Cites other non-parametric statistics literature that uses \(\alpha\) (p. 41).
Kieser, M., Friede, T., and Gondan, M. (2013), “Assessment of statistical significance and clinical relevance,” Statistics in Medicine, 32, 1707–1719. In the context of bringing attention to the need for assessing not only statistical significance but also clinical significance in clinical trials, and noting problems with “responder analysis” which uses a threshold for defining a desirable outcome, recommended the “relative effect” \(\theta = \text{P}(X > Y) + \frac{1}{2} \text{P}(X = Y)\).
Demidenko, E. (2016), “The \(p\)-value you can’t buy,” The American Statistician, 70(1), 33–38. Reviewed some criticism of \(p\)-values and — stating that “the root of the problem with the \(p\)-value is the group mean comparison” — recommended the “\(D\)-value” (or its complement, the “\(B\)-value”), the empirical estimate of \(\delta = \text{P}(X > Y)\). Greenland et al.6 published a strongly worded criticism of Demidenko, primarily because of his statement that “the \(D\)-value is the proportion of patients who got worse after the treatment.” Greenland et al. interpret this as Demidenko mistaking \(\text{P}(X > Y)\) for \(\text{P}[Y(1) > Y(0)]\) (or assuming that they were equal). To me, it’s not clear what Demidenko meant, or why he phrased his statement that way — and the criticism seems a little harsh. The phrase “patients who got worse” is strange because it seems to refer to the patient’s state improving or declining over time, which involves neither comparison with another patient nor comparison with this same patient’s counterfactual outcome under a comparison treatment. Meanwhile, Demidenko correctly describes the meaning of what he calls the \(D\)-value elsewhere in the paper.
Fay, M. P., Brittain, E. H., Shih, J. H., Follman, D. A., and Gabriel, E. E. (2018), “Causal estimands and confidence intervals associated with Wilcoxon‐Mann‐Whitney tests in randomized experiments,” Statistics in Medicine, 37, 2923–2937. Denoted the “Mann-Whitney parameter” \(\phi = \text{P}(X > Y) + \frac{1}{2}\text{P}(X = Y)\) and contrasted it with \(\psi = \text{P}[Y(1) > Y(0)] + \frac{1}{2}\text{P}[Y(1) = Y(0)]\), where \(Y(t)\) is the potential outcome for a patient given treatment \(t\). Reviewed “Hand’s paradox” (1992) in which \(\phi > \frac{1}{2}\) but \(\psi < \frac{1}{2}\) or vice-versa, and described some results about bounding \(\psi\).
P(X>Y) – P(Y>X) and [P(X>Y) – P(Y>X)] / P(X≠Y)
Deuchler, G. (1914), “Über die Methoden der Korrelationsrechnung in der Pädagogik und Psychologie,” Zeitschrift für Pädagogische Psychologie und Experimentelle Pädagogik, 15, 114–131, 145–159, 229–242. According to Kruskal,7 Deuchler suggested a method for comparing two samples that was essentially equivalent to the Wilcoxon rank-sum test. The test statistic could be considered a sample estimate of \(\frac{\text{P}(X>Y) - \text{P}(Y>X)}{\text{P}(X \ne Y)}\).
Kendall, M. G. (1938), “A new measure of rank correlation,” Biometrika, 30(1/2), 81–93. Proposed the correlation metric \(\tau\) which was a predecessor of — and is similar to — Goodman and Kruskal’s \(\gamma\) (1954).
Goodman, L. A., and Kruskal, W. H. (1954), “Measure of association for cross classifications,” Journal of the American Statistical Association, 49(268), 732–764. Proposed a measure of association between two ordinal variables, \(\gamma = \frac{\Pi_s - \Pi_d}{1 - \Pi_t}\). \(\Pi_s\) is the probability that, in a random pair of individuals from a population, the comparison of measurements of one variable will agree with the comparison of measurements of the other variable; in other words, either individual 1 has the greater value in both variables, or individual 2 has the greater value in both variables. \(\Pi_d\) is the probability that the comparison for one variable disagrees with the comparison of the other; \(\Pi_t\) is the probability of a tie in either variable. The same measure \(\gamma\) could be used for comparison of two population distributions by measuring the “correlation” of membership in one population (vs the other) with an ordinal outcome variable. Allowing the outcome to be continuous is another simple extension.
Simonoff, J. S., Hochberg, Y., and Reiser, B. (1986), “Alternative estimation procedures for Pr(X < Y) in categorized data,” Biometrics, 42, 895–907. Considered \(\text{P}(X > Y)\) to be an important estimand, but suggested estimating \(\lambda = \text{P}(X > Y) - \text{P}(Y > X)\) instead when \(\text{P}(X = Y) > 0\).
Cliff, N. (1993), “Dominance statistics: Ordinal analyses to answer ordinal questions,” Psychological Bulletin, 114(3), 494–509. Argued for use of ordinal methods in behavioral science, and suggested focusing on \(\delta = \text{P}(X > Y) - \text{P}(Y > X)\). Considered \(\delta\) preferable to \(\text{P}(X > Y)\) because of the possibility of ties.
Buyse, M. (2010), “Generalized pairwise comparisons of prioritized outcomes in the two-sample problem,” Statistics in Medicine, 29, 3245–3257. Considered the “proportion in favor of treatment” \(\Delta = \text{P}(X > Y) - \text{P}(Y > X)\) (conditional on \(X \ne Y\), though this wasn’t stated clearly) as a measure of treatment effect. For different types of univariate outcome, showed how testing for \(\Delta\) relates to standard statistical tests. Proposed using \(\Delta\) to compare several outcome variables simultaneously by prioritizing the outcomes, and thereby defining whether each possible comparison of outcomes is favorable, unfavorable, or neutral.
Connection to area under the ROC (above the OD) curve
Bamber, D. (1975), “The area above the ordinal dominance graph and the area below the receiver operating characteristic graph,” Journal of Mathematical Psychology, 12, 387–415. Described how \(\text{P}(X > Y) + \frac{1}{2}\text{P}(X = Y)\) is equal to the area above the ordinal dominance (OD) graph, which plots \(\text{P}(X \leq c)\) vs \(\text{P}(Y \leq c)\), and the area under the ROC graph, which plots \(\text{P}(X > c)\) vs \(\text{P}(Y > c)\), assuming one uses the trapezoidal method to compute area.
Estimation
Point estimates of \(\text{P}(X > Y)\) or related estimands generally are straight-forward. There appeared to be substantial literature about computing confidence intervals, which I don’t try to include here. One interesting estimation problem is when the outcomes \(X\) and \(Y\) are censored, as in the case of survival outcomes.
Efron, B. (1976), “The two-sample problem with censored data,” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 4, 831–853. For the problem of performing generalized Wilcoxon rank-sum test, reviewed the proposals of Gehan (1965) and Gilbert (1962). Proposed an alternative, superior test statistic based on (essentially) Kaplan-Meier estimates of the two population cdfs. The test statistic represents an estimate of \(\text{P}(X \geq Y)\).
Adjusting for covariates
These two references are examples — I didn’t systematically search for related methods.
Brumback, L. C., Pepe, M. S., and Alonzo, T. A. (2006), “Using the ROC curve for gauging treatment effect in clinical trials,” Statistics in Medicine, 25, 575–590. Adapted a method for ROC curve regression to provide a method for evaluating covariate-adjusted treatment effect in terms of a conditional \(\text{P}(X > Y)\). The regression model was fit on pairs of individuals, and the interest was in conditioning on the individuals having the same covariate values.
Thas, O., Neve, J. D., Clement, L., and Ottoy, J. P. (2012), “Probabilistic index models,” Journal of the Royal Statistical Society: Series B, 74, 623–671. Proposed a more general regression framework which extended the method of Brumback, Pepe, and Alonzo (2006).
Kruskal, W. H. (1957), “Historical notes on the Wilcoxon unpaired two-sample test,” Journal of the American Statistical Association 52(279), 356–360.↩︎
Ibid.↩︎
O’Brien, P. C. (1988), “Comparing two samples: Extensions of the t, rank-sum, and log-rank tests,” Journal of the American Statistical Association, 83(401), 52–61.↩︎
Senn, S. (2011), “U is for unease: Reasons for mistrusting overlap measures for reporting clinical trials,” Statistics in Biopharmaceutical Research, 3(2), 302–309.↩︎
Vargha, A., and Delaney, H. D. (1998), “The Kruskal-Wallis test and stochastic homogeneity,” Journal of Educational and Behavioral Statistics, 23(2), 170–192.↩︎
Greenland, S., Fay, M. P., Brittain, E. H., Shih, J. H., Follmann, D. A., Gabriel, E. E., and Robins, J. M. (2020), “On causal inferences for personalized medicine: How hidden causal assumptions led to erroneous causal claims about the D-value,” The American Statistician, 74(3), 243–248.↩︎
Kruskal, W. H. (1957).↩︎