Concordance probability as a measure of treatment effect

The area under the ROC curve (AUROC) is a familiar metric for how accurately a measurement or risk score discriminates between, say, individuals with and without a particular disease. My previous post describes how AUROC is equal to the concordance probability:

The AUROC is the probability that, in a random pairing of a disease patient with a non-disease patient, the disease patient has the higher score.

In this view, the distribution of a variable with a continuous (or at least ordinal) value — the score — is serving as a predictor, and a membership in one of two populations — disease or non-disease — is serving as a kind of dichotomous outcome.

However, the concordance probability is simply an indicator of how one distribution compares to another — how disease patient scores compare to non-disease patient scores. It’s very general: Given X drawn from one population, and Y drawn independently from another, how often is X greater than Y?

I wondered about the concordance probability being used with the roles of predictor and outcome reversed: Suppose we want an indicator of how a continuous or ordinal outcome compares between two populations, with membership in (or assignment to) one of the populations serving as a kind of predictor. Suppose — suppose! — we use concordance probability as a standardized metric for the effect of one treatment compared to another on an outcome?

I shouldn’t have been surprised when I found that Margaret Pepe’s book,¹ which was so helpful for my previous post, touches on this use of AUROC:

It has been noted that [ROC curves] can also be useful in applications outside of diagnostic testing. In essence, the ROC curve is a statistical tool that describes the separation between two distributions, \(S_D\) and \(S_{\overline{D}}\), and so it can be useful in any setting where separations between distributions are of interest. For example, such a description can be valuable for evaluating treatment effects on an outcome in a clinical trial or for evaluating risk factors for disease.²

With a brief literature search, I discovered that this is not just a cute idea among statisticians. The idea has been raised, at a minimum, in both psychology and psychopharmacology, with variations in the terminology.

In psychology, McGraw and Wong³ proposed the concordance probability as a way to state effect size, calling it the common language effect size, or CL. They write that

The primary value of the proposed statistic is that it is better than the available alternatives for communicating effect size to audiences untutored in statistics.

Hence, “common language” in the name. The alternatives they mention are Cohen’s d and several others I wasn’t familiar with. Indeed, one reason why I thought about the AUROC as potentially useful beyond diagnostic or predictive accuracy is that it seems intuitive and easy to explain to non-statisticians.

Vargha and Delaney⁴ suggested a sensible modification which they denote A. Whereas CL is the probability that a randomly sampled score or outcome from one population is strictly greater than a randomly sampled score or outcome from a comparison distribution, A is that same probability plus half the probability of a tie. A benefit of counting ties as “half wins” is that if A(i, j) represents the value of A for population i compared to population j, then A(1, 2) = 1 - A(2, 1), a kind of symmetry that doesn’t happen with CL if ties have non-zero probability. A also corresponds to a calculation of the area under the ROC curve using the trapezoidal rule — that is, using a straight line to interpolate between consecutive points and calculating area under the resulting piece-wise linear “curve.”

In psychopharmacology, Faraone et al.⁵ proposed the drug-response curve, which is the ROC curve by another name, and the area under the curve as tools to help “clarify the clinical relevance of drug-placebo differences that have already been shown to be statistically significant.” In their perspective, the curve represents a plot of response rate in the experimental drug group versus response rate in the placebo group, where “response” means having an outcome above a certain threshold. The curve is drawn by varying the threshold. This is exactly the same construction as an ROC curve, but I hadn’t thought of the coordinates in terms of response rates.

After this bit of reading, I have a few questions that I want to investigate further:

What are desirable characteristics of a standardized metric of treatment effect? And in terms of these characteristics, how does the concordance probability compare to alternatives?
Are there situations in clinical research, or elsewhere, in which the concordance probability is clearly inferior as a measure of effect? Where it indicates that effect A compares to effect B in a manner which seems clearly incorrect?
When the outcome is a censored time to event, what estimates of the concordance probability are possible?

Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press.↩︎
Ibid, p. 74.↩︎
McGraw, K. O., and Wong, S. P. (1992). “A Common Language Effect Size Statistic.” Psychological Bulletin 111(2), 361–365.↩︎
Vargha, A., and Delaney, H. D. (2000). “A Critique and Improvement of the ‘CL’ Common Language Effect Size Statistics of McGraw and Wong.” Journal of Educational and Behavioral Statistics 25(2), 101–132.↩︎
Faraone, S. V., Biederman, J., Spencer, T. J., and Wilens, T. E. (2000). “The Drug-Placebo Response Curve: A New Method for Assessing Drug Effects in Clinical Trials.” Journal of Clinical Psychopharmacology 20(6), 673–679.↩︎