13 min read

A puzzle about two-sample comparisons

I’ve been reading about probabilistic index models (PIMs).1 This is a type of regression model that assesses the dependence of an outcome Y on covariates (or group membership) based on the “probabilistic index,” P(Y<Y|X,X)+12P(Y=Y|X,X).

An interesting thing about these models is that for a sample of size n, the model fitting procedure (unless one customizes it) uses all n(n1)2 pairwise comparisons of the covariates and outcomes. If your data set looks like

x y
1 -0.47
2 -0.14
3 -1.61
4 -1.37

then the model is fit to a data set like

x1 y1 x2 y2
1 -0.47 2 -0.14
1 -0.47 3 -1.61
1 -0.47 4 -1.37
2 -0.14 3 -1.61
2 -0.14 4 -1.37
3 -1.61 4 -1.37

where each row in the original data is compared with each other row.

This is a problem when n is large. Fitting the model can be cumbersome or just infeasible. Why does this model require all pairwise comparisons, whereas other regression models don’t?

One answer might be: a PIM is inherently a model of an outcome resulting from pairwise comparisons, whereas a linear regression model or GLM is inherently a model of an individual’s outcome.

But suppose one formed a regression model for E(YY|X,X) where X represents covariates associated with Y, and X contains covariates for Y. This would be a model for pairwise comparisons of outcomes, yet the coefficients in

E(YY|X,X)=β1(X1X1)+β2(X2X2)+...

are exactly the same as those in the linear regression model

E(Y|X)=β0+β1X1+β2X2+...

which can be fit using just the n observations; still no need for the n(n1)2 pairs.

Two-sample comparisons

What about a simpler setting, comparing an outcome variable between two populations with no consideration of covariates?

Suppose I independently sample n0 individuals from one population and n1 individuals from another. For each individual, I observe outcome Y. Assume Y is continuous. I’ll use Y0 and Y1 to denote random outcomes from the two populations, sampled independently.

To make a statistical comparison between the two populations, the conventional options are

  1. A t-test, which can be considered a parametric test about the value of E(Y1)E(Y0) assuming Y1 and Y0 each is normally distributed (though it works as a test about E[Y1]E[Y0] in many non-normal settings).
  2. A Wilcoxon (Mann-Whitney) test, which can be considered a non-parametric test about the value of P(Y1>Y0) (plus 12P[Y1=Y0] if ties are possible).

I’ve always thought of option #1 as first summarizing each population with its mean (expected value), then comparing the means. Another way to think about it, which I didn’t consider until recently, is that you imagine sampling a random pair of outcomes, (Y0,Y1), compare them by taking a difference Y1Y0, and look at the mean over all such comparisons, E(Y1Y0).

Of course, E(Y1Y0) is mathematically the same as E(Y1)E(Y0). The reason for making a distinction is when thinking about an alternative comparison based on P(Y1>Y0).

For the Mann-Whitney test statistic, or for the sample estimate of P(Y1>Y0), the computation essentially involves comparing each value of Y in one sample to each value of Y in the other. The M-W statistic is the number of pairs (y0i,y1j) for which y0i<y1j (I’m assuming no ties). The sample estimate of P(Y1>Y0) is the mean of

c(y0i,y1j)={1 if y0i<y1j0 if y0i>y1j

overall all n0×n1 pairs. Thus, there’s a natural connection between the concept of P(Y1>Y0) as a mean over all pairwise comparisons between the two populations, and the actual computation for inference about P(Y1>Y0) from a sample.

In the case of E(Y1Y0), the computation doesn’t require looking at all n0×n1 differences y1jy0i. Primarily, the information about E(Y1Y0) boils down to the sample means, y¯1y¯0. Suppose n0=n1=n. Then

y¯1y¯0=1ni=1ny1i1ni=1ny0i=1ni=1n(y1iy0i)

which involves only n of the y1jy0i differences. A similar simplification is possible when n0n1. Whether n0=n1 or not, it seems that the number of arithmetic comparisons needed to extract all the information about E(Y1Y0) from the sample will be roughly proportional to n0+n1, not n0×n1.

Looking closer

Although this is starting to sound like a matter of algorithm analysis and big-O classification of the calculations for two-sample mean comparison, versus the two-sample Mann-Whitney comparison, I’m not just interested in computational steps needed for specific estimates or test statistics. I’m trying to understand at an intuitive level why one population comparison requires more pairwise comparisons than another.

Here’s one attempt:

Suppose that Y is the variable I want to compare, and X indicates whether an individual belongs to one population (x=1) or the other (x=0).

Suppose I sample one individual at a time and observe (xi,yi). With each individual, I consider which comparisons between it and other, previously sampled individuals are informative about δ=E(Y1Y0) and ϕ=P(Y1>Y0), where Y1 and Y0 are independently sampled from the populations with X=1 and X=0, respectively.

When δ is of interest, I’ll consider the xi vs xj comparison and yiyj as potential input for an algorithm that will estimate δ in some kind of optimal way, and I’ll ask, do I expect that adding this input could potentially change the estimate?

On the other hand, if ϕ is of interest, my intuition tells me that, whatever intermediate computations are done, when it comes to actually estimating ϕ, the only information from each pair (yi,yj) that the algorithm will incorporate into the estimate is whether yi>yj. (Why? I don’t know — just intuition.) So when ϕ is of interest, I’ll compute c(yi,yj) (see definition above), and I’ll ask whether each c(yi,yj) should be added to the input for an algorithm that estimates ϕ.

Part 1

Start with a setting where I’m estimating δ. Sample (x1,y1), then (x2,y2). Let’s assume x1=0 and x2=1. Obviously the comparison of these two observations is informative for δ.

i vs j xi vs xj Is yiyj informative for δ?
2 vs 1 1 vs 0 Yes

Next I sample (x3,y3). Now consider how (x1,y1) and (x2,y2) each compares to (x3,y3). Let’s say x3=1. Since y3 is new information about population 1, y3y1 seems clearly informative for δ. Let’s add it to the input for the algorithm. But y3y2? Since y3y2=(y3y1)(y2y1), the information that y3y2 represents is already in the rest of the data. There shouldn’t be any reason to provide that value.

i vs j xi vs xj Is yiyj informative for δ?
2 vs 1 1 vs 0 Yes
3 vs 1 1 vs 0 Yes
3 vs 2 1 vs 1 No

Continue with sampling (x4,y4), and let’s say x4=0. The difference y4y1 is a difference within group 0, which could affect the estimate of E(Y|X=0) and thereby affect the estimate of δ, so I include it in the input. However, y4y2=(y4y1)(y2y1) and y4y3=(y4y1)(y3y1), so those are redundant.

i vs j xi vs xj Is yiyj informative for δ?
2 vs 1 1 vs 0 Yes
3 vs 1 1 vs 0 Yes
3 vs 2 1 vs 1 No
4 vs 1 0 vs 0 Yes
4 vs 2 0 vs 1 No
4 vs 3 0 vs 1 No

For each subsequent sample, all that the algorithm would need to know is the difference between the new yi and y1. All other differences are redundant. Therefore, after n samples, there will be n1 informative differences — in other words, n1 degrees of freedom.

Part 2

What if I’m interested in ϕ instead?

My intuition here is that what matters for ϕ is the order of the yi values. This might be obvious. The Wilcoxon test, which is based on ranking the combined sample, clearly returns the same result as long as the order of the outcome values is the same.

More specifically, what matters is where outcomes from group x=0 and group x=1 appear in the sequence of outcomes. Suppose I assign indices i so that y1<y2<...<yn. Then I look at x1,x2,...,xn, which is a sequence of 0s and 1s. What matters for estimating ϕ is that there are, say, three 0s, then one 1, then two 0s, etc., in the sequence of xs. Any other set of outcomes producing the same sequence of 0s and 1s should contain the same information about ϕ.2

Suppose I sample (xi,yj) one at-a-time with the xis being the same as before.

Clearly comparing (x1,y1) to (x2,y2) is informative. I use it to determine which of the following is the sequence of 0s and 1s.

(x1=0)(x2=1)(x2=1)(x1=0)

Suppose c(y2,y1)=1, so I have the first sequence above. Now I sample (x3,y3) and look at c(y3,y1). Suppose c(y3,y1)=1. Then the possible sequences are

(x1=0)(x3=1)(x2=1)(x1=0)(x2=1)(x3=1)

Then it doesn’t matter whether y3>y2 or y2>y3; it doesn’t change the pattern of 0s and 1s, so c(y3,y2) isn’t informative.

On the other hand, if c(y3,y1)=0, the only possibility is

(x3=1)(x1=0)(x2=1)

in which case c(y3,y2) isn’t informative because it must be equal to 0.

i vs j xi vs xj Is c(yi,yj) informative for ϕ?
2 vs 1 1 vs 0 Yes
3 vs 1 1 vs 0 Yes
3 vs 2 1 vs 1 No

When I sample (x4,y4), the situation gets a little complicated. But I wrote some code to determine which combinations of c(yi,yj) comparisons are sufficient to determine the Mann-Whitney statistic from 4 observations, and it showed that (perhaps not surprisingly)

  1. The minimum number of comparisons occurs when using all between-group comparisons (each observation having xi=1 compared with each having xj=0).
  2. Those between-group comparisons are always necessary; the Mann-Whitney statistic isn’t guaranteed to be completely determined by the data unless all of those comparisons are included in the data.
i vs j xi vs xj Is c(yi,yj) informative for ϕ?
2 vs 1 1 vs 0 Yes
3 vs 1 1 vs 0 Yes
3 vs 2 1 vs 1 No
4 vs 1 0 vs 0 No
4 vs 2 0 vs 1 Yes
4 vs 3 0 vs 1 Yes

Running similar code, the same was true when a 5th observation, having x5=0, was added to the mix.

Part 3

Here’s another way to see mathematically why E(Y1Y0) doesn’t require looking at all pairwise differences.

Suppose the outcomes for population 1 are x1,,xn, and the outcomes for population 0 are y1,,yn,yn+1,,yn+k. That is, population 0 has more individuals. (Things would work similarly if population 1 had more.)

The mean difference between all pairs is

E(XY)=1n(n+k)i=1nj=1n+k(xiyj)=1n(n+k)[i=1nj=1n(xiyi)+i=1nj=n+1n+k(xiyj)]

How do some of these comparisons become redundant? Note that

xiyj=(xiyi)+(xjyj)(xjyi)

so

(xiyj)+(xjyi)=(xiyi)+(xjyj)

As a result, in the sum over all pairwise differences among the first n individuals in each population, i=1nj=1n(xiyi), the differences with ij can be substituted with i=j differences, yielding

i=1nj=1n(xiyi)=ni=1n(xiyi)

Thus, for comparing the first n individuals in each population, I need only x1y1, …, xnyn.

What remains is the n×k differences between x1,,xn and yn+1,,yn+k: i=1nj=n+1n+k(xiyj). These too can be reduced to a smaller number of differences. Suppose I start by committing to use xnyn+1, xnyn+2, …, xnyn+k (each “extra” y outcome compared to the last x outcome). Then for the remaining differences, I need only xnxj, for j=1,,n1:

xiyj=(xnyj)+(xixn)

Then I’m using only n+k1 differences, rather than n×k. This part actually can simplify in a different, interesting way:

i=1nj=n+1n+k(xiyj)=ki=1nxinj=n+1n+kyj=knx¯kny¯(n)

where y¯(n) indicates the mean of the “extra” y values, i.e. after excluding the first n which are matched with corresponding x values.

The result is

E(XY)=1n(n+k)[ni=1n(xiyi)+kn(x¯y¯(n))]=nn+kd¯(n)+kn+k(x¯y¯(n))=nn+k(x¯y¯(n))+kn+k(x¯y¯(n))

where di=xiyi, d¯(n) is the mean of d1,,dn, and y¯(n) is the mean of y1,,yn. Thus, E(XY) can be calculated as a weighted average, combining the average of n “one-to-one” differences with an adjustment for the additional outcomes of the individuals in the larger population.

Conclusion

To summarize: Estimation of P(X>Y) requires all pairwise comparisons between the two samples, but estimation of E(XY) requires only a subset of comparisons. One way of understanding the case of E(XY) is that many of the comparisons are redundant. The richer information in each comparison, when those comparisons are differences in an interval-type variable (i.e. a variable for which differences are meaningfully defined), allow for extracting the relevant information about E(XY) from a smaller number of comparisons. Comparisons of the type c(y0i,y1j) provide too little information individually to extract the relevant information about P(X>Y) from a subset of comparisons.

A question I still have: Whether looking at a simple two-sample comparison or at regression modeling around P(X>Y), do some of the pairwise comparisons provide much more information than others? For analysis of large samples, could the comparisons be pruned in some a priori way, to avoid the computational burden of n(n1)2 comparisons while losing little information?


  1. Thas, O., Neve, J. D., Clement, L., and Ottoy, J. P. (2012), “Probabilistic index models,” Journal of the Royal Statistical Society: Series B, 74, 623–671.↩︎

  2. The idea of considering sequences of 0s and 1s comes from the paper describing the Mann-Whitney test: Mann, H. B., and Whitney, D. R. (1947), “On a test of whether one of two random variables is stochastically larger than the other,” Annals of Mathematical Statistics, 18(1), 50–60.↩︎