6 min read

Subgroup analysis in R

Sometimes a statistical analysis is to be repeated in subgroups defined by one or more variables. This is particularly common in clinical trials, where one component of understanding the effect of the investigational treatment may be understanding whether the effect is substantially stronger in particular groups of patients.

Recently I needed to repeat an analysis in subgroups defined by each of about 12 variables. In R, I used nested for loops to accomplish the task, but I suspected there was a more concise way using tidyverse functions. Here’s what I found.

Setup

Here are some fake data I’ll use in the following code examples. Consider x1x3 to be grouping variables, and y an outcome variable.

library(dplyr) # for tidy data manipulation

n <- 1000
tbl <-
  tibble(x1 = sample(1:2, n, replace = TRUE),
         x2 = sample(1:3, n, replace = TRUE),
         x3 = sample(1:4, n, replace = TRUE),
         y = rnorm(n)) %>%
  mutate(across(x1:x3, factor))
head(tbl)
## # A tibble: 6 × 4
##   x1    x2    x3            y
##   <fct> <fct> <fct>     <dbl>
## 1 2     2     1      0.240   
## 2 2     1     2      0.889   
## 3 1     2     3     -0.000947
## 4 1     3     1     -1.22    
## 5 1     1     4     -0.300   
## 6 1     2     4      1.20

One subgroup analysis

First, to illustrate the problem, suppose we want to repeat an analysis in subgroups defined by only one variable, x1. Using a for loop might look like this:

library(broom) # for tidy model results

# The analysis to be done in each subgroup.
# Returns, e.g., point and interval estimates
# for a parameter.
do_analysis <- function (data) {
  lm(y ~ 1, data = data) %>%
    tidy(conf.int = TRUE) %>%
    select(estimate, conf.low, conf.high)
}

output <- NULL
for (lvl in levels(tbl$x1)) {
  tmp <-
    tbl %>%
    filter(x1 %in% lvl)
  output <-
    output %>%
    bind_rows(do_analysis(tmp) %>%
                mutate(x1 = lvl,
                       .before = 1))
}
output
## # A tibble: 2 × 4
##   x1    estimate conf.low conf.high
##   <chr>    <dbl>    <dbl>     <dbl>
## 1 1      0.00524  -0.0807    0.0912
## 2 2     -0.0716   -0.160     0.0165

The same task can be accomplished in a slightly more concise way, in one chain of “piped” operations, using functions from the tidyr and purrr packages:

library(purrr) # for mapping
library(tidyr) # for nesting

tbl %>%
  # For each unique value of x1, collapse the
  # remaining data into a data frame and put
  # these data frames in a column named 'data'
  nest(data = -x1) %>%
  # Perform analysis on each data frame
  mutate(est = map(data, do_analysis)) %>%
  # Unpack the analysis results
  unnest(cols = est) %>%
  select(-data)
## # A tibble: 2 × 4
##   x1    estimate conf.low conf.high
##   <fct>    <dbl>    <dbl>     <dbl>
## 1 2     -0.0716   -0.160     0.0165
## 2 1      0.00524  -0.0807    0.0912

Multiple subgroup analyses

Similarly, when analysis needs to be repeated across subgroups defined by multiple variables, one option is to use nested for loops:

output <- NULL
subgroup.vars <- c("x1", "x2", "x3")
for (var in subgroup.vars) {
  for (lvl in levels(tbl[[var]])) {
    tmp <- tbl[tbl[[var]] %in% lvl, ]
    output <-
      output %>%
      bind_rows(do_analysis(tmp) %>%
                  mutate(var = var,
                         lvl = lvl,
                         .before = 1))
  }
}
output
## # A tibble: 9 × 5
##   var   lvl   estimate conf.low conf.high
##   <chr> <chr>    <dbl>    <dbl>     <dbl>
## 1 x1    1      0.00524  -0.0807    0.0912
## 2 x1    2     -0.0716   -0.160     0.0165
## 3 x2    1     -0.0746   -0.179     0.0294
## 4 x2    2      0.0405   -0.0693    0.150 
## 5 x2    3     -0.0548   -0.161     0.0518
## 6 x3    1     -0.0360   -0.164     0.0919
## 7 x3    2     -0.0950   -0.217     0.0272
## 8 x3    3     -0.0240   -0.151     0.103 
## 9 x3    4      0.0253   -0.0902    0.141

Alternatively, one can use the nest and map functions, but this time it’s a little tricky. First, I create a table of the subgroup variables and their levels:

tmp <-
  tibble(var = subgroup.vars) %>%
  mutate(lvl = map(var, ~ levels(tbl[[.x]]))) %>%
  unnest(cols = lvl)
tmp
## # A tibble: 9 × 2
##   var   lvl  
##   <chr> <chr>
## 1 x1    1    
## 2 x1    2    
## 3 x2    1    
## 4 x2    2    
## 5 x2    3    
## 6 x3    1    
## 7 x3    2    
## 8 x3    3    
## 9 x3    4

Then I add the data sets corresponding to each subgroup. In the resulting table, the original data set has been replicated (while also being split by subgroups). Each original observation appears g times, where g is the number of subgroup variables.

tmp %>%
  mutate(data =
           map2(var, lvl,
                ~ tbl[tbl[[.x]] %in% .y, ]))
## # A tibble: 9 × 3
##   var   lvl   data              
##   <chr> <chr> <list>            
## 1 x1    1     <tibble [521 × 4]>
## 2 x1    2     <tibble [479 × 4]>
## 3 x2    1     <tibble [342 × 4]>
## 4 x2    2     <tibble [315 × 4]>
## 5 x2    3     <tibble [343 × 4]>
## 6 x3    1     <tibble [232 × 4]>
## 7 x3    2     <tibble [243 × 4]>
## 8 x3    3     <tibble [272 × 4]>
## 9 x3    4     <tibble [253 × 4]>

Finally, we perform analysis on each data set and unpack the results. Altogether the operation looks like this:

tibble(var = subgroup.vars) %>%
  mutate(lvl = map(var, ~ levels(tbl[[.x]]))) %>%
  unnest(cols = lvl) %>%
  mutate(data =
           map2(var, lvl,
                ~ tbl[tbl[[.x]] %in% .y, ])) %>%
  mutate(est = map(data, do_analysis)) %>%
  unnest(cols = est) %>%
  select(-data)
## # A tibble: 9 × 5
##   var   lvl   estimate conf.low conf.high
##   <chr> <chr>    <dbl>    <dbl>     <dbl>
## 1 x1    1      0.00524  -0.0807    0.0912
## 2 x1    2     -0.0716   -0.160     0.0165
## 3 x2    1     -0.0746   -0.179     0.0294
## 4 x2    2      0.0405   -0.0693    0.150 
## 5 x2    3     -0.0548   -0.161     0.0518
## 6 x3    1     -0.0360   -0.164     0.0919
## 7 x3    2     -0.0950   -0.217     0.0272
## 8 x3    3     -0.0240   -0.151     0.103 
## 9 x3    4      0.0253   -0.0902    0.141

Is this method really the more concise alternative that I was looking for? Not exactly; it’s similar in the amount of code, compared to the nested for loops. Although I like having one chain of operations — and no intermediate/temporary objects — in the style of the tidyverse, this method is less efficient in a way (by replicating the data), and the code might be more difficult to understand than the nested for loops.

Conclusion

For a single subgroup analysis, I prefer the more concise method using the tidyverse’s nest-map-unnest functions.

For multiple subgroup variables, it’s possible to use tidyverse functions in a single chain of operations, but it’s less clear whether this method is an improvement over for loops.