I’ve been reading about probabilistic index models (PIMs).1 This is a type of regression model that assesses the dependence of an outcome on covariates (or group membership) based on the “probabilistic index,” .
An interesting thing about these models is that for a sample of size , the model fitting procedure (unless one customizes it) uses all pairwise comparisons of the covariates and outcomes. If your data set looks like
| x | y |
|---|---|
| 1 | -0.47 |
| 2 | -0.14 |
| 3 | -1.61 |
| 4 | -1.37 |
then the model is fit to a data set like
| x1 | y1 | x2 | y2 |
|---|---|---|---|
| 1 | -0.47 | 2 | -0.14 |
| 1 | -0.47 | 3 | -1.61 |
| 1 | -0.47 | 4 | -1.37 |
| 2 | -0.14 | 3 | -1.61 |
| 2 | -0.14 | 4 | -1.37 |
| 3 | -1.61 | 4 | -1.37 |
where each row in the original data is compared with each other row.
This is a problem when is large. Fitting the model can be cumbersome or just infeasible. Why does this model require all pairwise comparisons, whereas other regression models don’t?
One answer might be: a PIM is inherently a model of an outcome resulting from pairwise comparisons, whereas a linear regression model or GLM is inherently a model of an individual’s outcome.
But suppose one formed a regression model for where represents covariates associated with , and contains covariates for . This would be a model for pairwise comparisons of outcomes, yet the coefficients in
are exactly the same as those in the linear regression model
which can be fit using just the observations; still no need for the pairs.
Two-sample comparisons
What about a simpler setting, comparing an outcome variable between two populations with no consideration of covariates?
Suppose I independently sample individuals from one population and individuals from another. For each individual, I observe outcome . Assume is continuous. I’ll use and to denote random outcomes from the two populations, sampled independently.
To make a statistical comparison between the two populations, the conventional options are
- A t-test, which can be considered a parametric test about the value of assuming and each is normally distributed (though it works as a test about in many non-normal settings).
- A Wilcoxon (Mann-Whitney) test, which can be considered a non-parametric test about the value of (plus if ties are possible).
I’ve always thought of option #1 as first summarizing each population with its mean (expected value), then comparing the means. Another way to think about it, which I didn’t consider until recently, is that you imagine sampling a random pair of outcomes, , compare them by taking a difference , and look at the mean over all such comparisons, .
Of course, is mathematically the same as . The reason for making a distinction is when thinking about an alternative comparison based on .
For the Mann-Whitney test statistic, or for the sample estimate of , the computation essentially involves comparing each value of in one sample to each value of in the other. The M-W statistic is the number of pairs for which (I’m assuming no ties). The sample estimate of is the mean of
overall all pairs. Thus, there’s a natural connection between the concept of as a mean over all pairwise comparisons between the two populations, and the actual computation for inference about from a sample.
In the case of , the computation doesn’t require looking at all differences . Primarily, the information about boils down to the sample means, . Suppose . Then
which involves only of the differences. A similar simplification is possible when . Whether or not, it seems that the number of arithmetic comparisons needed to extract all the information about from the sample will be roughly proportional to , not .
Looking closer
Although this is starting to sound like a matter of algorithm analysis and big-O classification of the calculations for two-sample mean comparison, versus the two-sample Mann-Whitney comparison, I’m not just interested in computational steps needed for specific estimates or test statistics. I’m trying to understand at an intuitive level why one population comparison requires more pairwise comparisons than another.
Here’s one attempt:
Suppose that is the variable I want to compare, and indicates whether an individual belongs to one population or the other .
Suppose I sample one individual at a time and observe . With each individual, I consider which comparisons between it and other, previously sampled individuals are informative about and , where and are independently sampled from the populations with and , respectively.
When is of interest, I’ll consider the vs comparison and as potential input for an algorithm that will estimate in some kind of optimal way, and I’ll ask, do I expect that adding this input could potentially change the estimate?
On the other hand, if is of interest, my intuition tells me that, whatever intermediate computations are done, when it comes to actually estimating , the only information from each pair that the algorithm will incorporate into the estimate is whether . (Why? I don’t know — just intuition.) So when is of interest, I’ll compute (see definition above), and I’ll ask whether each should be added to the input for an algorithm that estimates .
Part 1
Start with a setting where I’m estimating . Sample , then . Let’s assume and . Obviously the comparison of these two observations is informative for .
| vs | vs | Is informative for ? |
|---|---|---|
| 2 vs 1 | 1 vs 0 | Yes |
Next I sample . Now consider how and each compares to . Let’s say . Since is new information about population 1, seems clearly informative for . Let’s add it to the input for the algorithm. But ? Since , the information that represents is already in the rest of the data. There shouldn’t be any reason to provide that value.
| vs | vs | Is informative for ? |
|---|---|---|
| 2 vs 1 | 1 vs 0 | Yes |
| 3 vs 1 | 1 vs 0 | Yes |
| 3 vs 2 | 1 vs 1 | No |
Continue with sampling , and let’s say . The difference is a difference within group 0, which could affect the estimate of and thereby affect the estimate of , so I include it in the input. However, and , so those are redundant.
| vs | vs | Is informative for ? |
|---|---|---|
| 2 vs 1 | 1 vs 0 | Yes |
| 3 vs 1 | 1 vs 0 | Yes |
| 3 vs 2 | 1 vs 1 | No |
| 4 vs 1 | 0 vs 0 | Yes |
| 4 vs 2 | 0 vs 1 | No |
| 4 vs 3 | 0 vs 1 | No |
For each subsequent sample, all that the algorithm would need to know is the difference between the new and . All other differences are redundant. Therefore, after samples, there will be informative differences — in other words, degrees of freedom.
Part 2
What if I’m interested in instead?
My intuition here is that what matters for is the order of the values. This might be obvious. The Wilcoxon test, which is based on ranking the combined sample, clearly returns the same result as long as the order of the outcome values is the same.
More specifically, what matters is where outcomes from group and group appear in the sequence of outcomes. Suppose I assign indices so that . Then I look at , which is a sequence of 0s and 1s. What matters for estimating is that there are, say, three 0s, then one 1, then two 0s, etc., in the sequence of s. Any other set of outcomes producing the same sequence of 0s and 1s should contain the same information about .2
Suppose I sample one at-a-time with the s being the same as before.
Clearly comparing to is informative. I use it to determine which of the following is the sequence of 0s and 1s.
Suppose , so I have the first sequence above. Now I sample and look at . Suppose . Then the possible sequences are
Then it doesn’t matter whether or ; it doesn’t change the pattern of 0s and 1s, so isn’t informative.
On the other hand, if , the only possibility is
in which case isn’t informative because it must be equal to 0.
| vs | vs | Is informative for ? |
|---|---|---|
| 2 vs 1 | 1 vs 0 | Yes |
| 3 vs 1 | 1 vs 0 | Yes |
| 3 vs 2 | 1 vs 1 | No |
When I sample , the situation gets a little complicated. But I wrote some code to determine which combinations of comparisons are sufficient to determine the Mann-Whitney statistic from 4 observations, and it showed that (perhaps not surprisingly)
- The minimum number of comparisons occurs when using all between-group comparisons (each observation having compared with each having ).
- Those between-group comparisons are always necessary; the Mann-Whitney statistic isn’t guaranteed to be completely determined by the data unless all of those comparisons are included in the data.
| vs | vs | Is informative for ? |
|---|---|---|
| 2 vs 1 | 1 vs 0 | Yes |
| 3 vs 1 | 1 vs 0 | Yes |
| 3 vs 2 | 1 vs 1 | No |
| 4 vs 1 | 0 vs 0 | No |
| 4 vs 2 | 0 vs 1 | Yes |
| 4 vs 3 | 0 vs 1 | Yes |
Running similar code, the same was true when a 5th observation, having , was added to the mix.
Part 3
Here’s another way to see mathematically why doesn’t require looking at all pairwise differences.
Suppose the outcomes for population 1 are , and the outcomes for population 0 are . That is, population 0 has more individuals. (Things would work similarly if population 1 had more.)
The mean difference between all pairs is
How do some of these comparisons become redundant? Note that
so
As a result, in the sum over all pairwise differences among the first individuals in each population, , the differences with can be substituted with differences, yielding
Thus, for comparing the first individuals in each population, I need only , …, .
What remains is the differences between and : . These too can be reduced to a smaller number of differences. Suppose I start by committing to use , , …, (each “extra” outcome compared to the last outcome). Then for the remaining differences, I need only , for :
Then I’m using only differences, rather than . This part actually can simplify in a different, interesting way:
where indicates the mean of the “extra” values, i.e. after excluding the first which are matched with corresponding values.
The result is
where , is the mean of , and is the mean of . Thus, can be calculated as a weighted average, combining the average of “one-to-one” differences with an adjustment for the additional outcomes of the individuals in the larger population.
Conclusion
To summarize: Estimation of requires all pairwise comparisons between the two samples, but estimation of requires only a subset of comparisons. One way of understanding the case of is that many of the comparisons are redundant. The richer information in each comparison, when those comparisons are differences in an interval-type variable (i.e. a variable for which differences are meaningfully defined), allow for extracting the relevant information about from a smaller number of comparisons. Comparisons of the type provide too little information individually to extract the relevant information about from a subset of comparisons.
A question I still have: Whether looking at a simple two-sample comparison or at regression modeling around , do some of the pairwise comparisons provide much more information than others? For analysis of large samples, could the comparisons be pruned in some a priori way, to avoid the computational burden of comparisons while losing little information?
Thas, O., Neve, J. D., Clement, L., and Ottoy, J. P. (2012), “Probabilistic index models,” Journal of the Royal Statistical Society: Series B, 74, 623–671.↩︎
The idea of considering sequences of 0s and 1s comes from the paper describing the Mann-Whitney test: Mann, H. B., and Whitney, D. R. (1947), “On a test of whether one of two random variables is stochastically larger than the other,” Annals of Mathematical Statistics, 18(1), 50–60.↩︎